BARD, ChatGPT, and Watson in Jeopardy!: A Question-Answering Showdown

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.Sep 5 2023

In a paper published in the journal Ai Magazine, researchers explored the performance of Bayesian Argumentation via Delphi (BARD), Chat Generative Pre-trained Transformer (ChatGPT), and Watson by analyzing their responses to Jeopardy! questions. It was observed that when presented with high-confidence Watson questions, all three systems exhibited comparable levels of accuracy.

*Study: BARD, ChatGPT, and Watson in Jeopardy!: A Question-Answering Showdown. Image credit: Teerapong mahawan/Shutterstock*

The performance of both BARD and ChatGPT was noteworthy. They achieved a level of performance on par with human experts, and their correct responses displayed a high degree of similarity. This was demonstrated by a Tanimoto similarity score. However, the study revealed that both BARD and ChatGPT could produce different and conflicting answers when presented with the same Jeopardy! category and question multiple times. The present paper also delves into the characteristics of questions that lead to these discrepancies and discusses the implications and challenges related to the lack of answer reproducibility in testing such systems.

Background

Drawing upon insights from prior research, the authors of this paper thoroughly explore the examination of three key question-answering systems: Watson, ChatGPT, and BARD. The examination of this paper takes place within the framework of the renowned quiz show, Jeopardy! This long-running show involves contestants tackling questions of varying difficulty and worth, which are grouped into categories. IBM's Watson has many models and decision-making abilities and was created specifically for Jeopardy! It boasts a diverse array of models and decision-making capabilities. On the other hand, ChatGPT, developed by OpenAI, excels at responding to information queries due to its neural network-based training. Meanwhile, Google's BARD harnesses Language Model for Dialogue Applications (LaMDA) and integrates up-to-date data sources into its framework.

The approach selected Watson as the baseline due to its Jeopardy! success and well-documented development. For comparison, Google's BARD and OpenAI's ChatGPT were chosen by knowing the extensive usage of chatbot applications. Data input into BARD and ChatGPT followed both "long form" and "short form" methods. Their responses were assessed by comparing them to Jeopardy! Challenge videos. This process highlighted the potential for changing answers, and multiple responses to the same questions also revealed the system performance nuances.

Findings of the Study

The findings encompass various aspects of the performance and behavior of Watson, BARD, and ChatGPT in the context of Jeopardy! question answering. The study's findings indicate that both BARD and ChatGPT demonstrate higher accuracy than Watson when answering Jeopardy questions. Moreover, there is no statistically significant distinction in performance between BARD and ChatGPT. Expressing the intent to play Jeopardy has a significant impact on responses. This influence is particularly notable in the case of ChatGPT.

Unseen Jeopardy questions from 2023 yielded similar results to the Jeopardy! Challenge questions. Dollar value did not correlate with question difficulty for any of the systems. The confidence factor of Watson significantly impacts its correctness, while its influence varies in significance for BARD and ChatGPT. Text analysis failed to identify clear confidence indicators in the responses of these language models.

What Constitutes 'Problematic' Questions for ChatGPT?

Challenging Jeopardy! questions present difficulties for ChatGPT and lead to divergent answers. Every question is paired with its respective Jeopardy! category and query, and the responses are presented in multiple variations for each question. These instances notably underscore the challenge of the system in disambiguating queries and the significant impact of context. They emphasize the need for improved handling of intricate natural language queries to understand user intent.

Challenging Queries for BARD

Similar to ChatGPT, BARD also exhibits variability in its responses to the same questions on different occasions. The following examples highlight instances where BARD exhibited variability in its responses to repeated queries. This variability is similar to what was observed with ChatGPT. For instance, a question about the European Capital of Culture in 2010 led to both Istanbul and Pécs as answers. In another case, the question regarding New Zealand's second-largest city produced responses from Hamilton and Christchurch.

Additionally, a query about a mosquito-borne joint illness resulted in both chikungunya and dengue as answers. This highlights the potential difficulty of the system in disambiguating similar concepts. These examples emphasize the need for enhanced handling of complex queries and context in natural language understanding systems.

What Causes Multiple Conflicting Answers?

Both BARD and ChatGPT exhibit non-deterministic behavior as they provide different answers to the same inputs in certain cases. The variability observed in these systems can be attributed to their difficulty in accurately predicting user intent. This often resorts to guessing instead of seeking clarifications when faced with ambiguous queries. Additionally, multiple similarly ranked answers further complicate decision-making during query processing, contributing to the divergence in responses.

Emerging Challenges in Evaluating Large Language Systems

The evaluation of BARD and ChatGPT using the Jeopardy Challenge questions revealed testing challenges. The ability of the system to provide multiple and conflicting responses to identical inputs raised concerns about reproducibility and system validation. To address this variability, an alternative approach involves treating system outputs as distributions, facilitating sensitivity analysis, and probabilistic testing.

Democratizing validation through broad user access presents a potential solution to the challenges of system testing. However, its implementation requires meticulous oversight due to complex social and ethical implications. The perspective of a penetration study, which concentrates on behavioral and social aspects rather than security vulnerabilities, provides a novel approach to evaluating the reliability and consistency of large language models.

Conclusion

To sum up, BARD and ChatGPT demonstrate expertise in answering Jeopardy! questions. The similarity measure of Tanimoto indicates high answer consistency. However, query variations can yield different responses. It emphasizes the need for robust testing methods that include probabilistic accuracy rates and the similarity index of Tanimoto. This helps in assessing improvements over time. Additionally, reproducibility and a penetration study approach are key considerations for testing these systems.

Journal reference:

O’Leary, D. E. (2023). An analysis of Watson vs. BARD vs. ChatGPT: The Jeopardy! Challenge. Ai Magazine. https://doi.org/10.1002/aaai.12118, https://onlinelibrary.wiley.com/doi/10.1002/aaai.12118.

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2023, September 05). BARD, ChatGPT, and Watson in Jeopardy!: A Question-Answering Showdown. AZoAi. Retrieved on March 31, 2025 from https://www.azoai.com/news/20230905/BARD-ChatGPT-and-Watson-in-Jeopardy!-A-Question-Answering-Showdown.aspx.
MLA
Chandrasekar, Silpaja. "BARD, ChatGPT, and Watson in Jeopardy!: A Question-Answering Showdown". AZoAi. 31 March 2025. <https://www.azoai.com/news/20230905/BARD-ChatGPT-and-Watson-in-Jeopardy!-A-Question-Answering-Showdown.aspx>.
Chicago
Chandrasekar, Silpaja. "BARD, ChatGPT, and Watson in Jeopardy!: A Question-Answering Showdown". AZoAi. https://www.azoai.com/news/20230905/BARD-ChatGPT-and-Watson-in-Jeopardy!-A-Question-Answering-Showdown.aspx. (accessed March 31, 2025).
Harvard
Chandrasekar, Silpaja. 2023. BARD, ChatGPT, and Watson in Jeopardy!: A Question-Answering Showdown. AZoAi, viewed 31 March 2025, https://www.azoai.com/news/20230905/BARD-ChatGPT-and-Watson-in-Jeopardy!-A-Question-Answering-Showdown.aspx.