ChatGPT, Bard, and Claude Tackle Neurophysiology Questions

In a paper published in the journal Scientific Reports, researchers investigated the proficiency of large language models (LLMs) like chat generative pre-trained transformer (ChatGPT), Google's Bard, and Anthropic's Claude in tackling neurophysiology questions.

Scores of LLMs to English questions. Image Credit: https://www.nature.com/articles/s41598-024-60405-y
Scores of LLMs to English questions. Image Credit: https://www.nature.com/articles/s41598-024-60405-y

Twenty questions spanning different topics and cognitive levels were presented to the models in English and Persian (Farsi), with physiologists scoring their responses on a scale of 0 to 5. Despite facing challenges in integrative topics, the LLMs demonstrated commendable performance overall, with a moderate score. The study's findings shed light on the strengths and limitations of LLMs in neurophysiology and emphasize the importance of targeted training to enhance their capabilities.

Related Work

Previous works have extensively explored the burgeoning field of conversational artificial intelligence (AI) and the proliferation of advanced language models like ChatGPT, Google's Bard, and Anthropic's Claude. These models, equipped with many functionalities, have found utility across diverse domains, from generating human-like responses to aiding in professional tasks such as drafting research proposals and coding.

Evaluating these models has become increasingly crucial, with studies focusing on their performance in various disciplines like gastroenterology, pathology, neurology, and physiology. Although some studies have evaluated LLMs' performance in multiple-choice question formats, particularly in neurology board-style exams, there's still a need for a more thorough assessment tailored to neurophysiology.

Neurophysiology Question Assessment

AI-driven chat applications, including ChatGPT, Claude, and Bard, were utilized to assess their efficacy in answering neurophysiology questions. A total of 20 questions covering four neurophysiology topics, namely general, sensory, motor, and integrative systems, were selected. These questions encompassed true/false, multiple-choice, and essay formats, allowing for a scoring range of 0–5 points for the responses. Categorization based on cognitive skills into lower-order and higher-order categories enabled a comprehensive evaluation of the models' capabilities.

A panel of three skilled physiologists with expertise in neurophysiology was involved to ensure the validity of the questions and evaluation process. Data collection involved prompting the latest versions of ChatGPT 3.5, Claude 2, and Bard with questions in Persian and English. The team employed prompt engineering strategies such as chain-of-thought (CoT) prompting and structured prompting to enhance LLMs' response efficiency.

Physiologists evaluated the answers provided by the LLMs, scoring each question on a scale of zero to five points, with five indicating a full and comprehensive response. All data, including the questions, answers generated by the LLMs, and physiologists' scores, were recorded for further analysis. Statistical analysis assessed variations in scores between Persian and English languages across different topics and cognitive skill levels. Mean, median and standard deviation provided an overview of the data.

At the same time, tests such as the Friedman test, Kruskal‒Wallis's test, and Wilcoxon signed rank test were utilized to ascertain statistical significance. The level of agreement among physiologists' scores was evaluated using the intraclass correlation coefficient (ICC). Statistical package for the social sciences (SPSS) software was employed for all statistical analyses, with a significance level set at p < 0.05.

Language Model Evaluation

The study gathered responses from three prominent language models: ChatGPT, Bard, and Claude. Each question was assessed by a panel of three experienced physiologists, ensuring robust evaluation. The study design simulated an exam scenario, with each question posed to the models only once. This approach aimed to replicate real-world conditions where ambiguity or lack of understanding could impact the models' responses. The results were recorded and analyzed to gauge the reliability of the language models' outputs.

The evaluation process demonstrated strong agreement among the physiologists in scoring, indicating the consistency and reliability of their assessments. This agreement was quantified using the intraclass correlation coefficient (ICC), with values ranging from 0.935 to 0.993 across various topics. The high ICC value for all questions collectively (0.978) underscored the reliability of expert opinions. Such robust interrater agreement laid a solid foundation for subsequent analyses of the language models' performance.

Language models, including ChatGPT, Bard, and Claude, performed satisfactorily in addressing neurophysiology questions, averaging a score of 3.87 ± 1.7. While variations occurred between Persian and English responses, they were not statistically significant, nor across different cognitive skill levels.

Notably, the motor system scored highest and the integrative topic lowest. Further analysis uncovered inconsistencies and inaccuracies, underscoring the need for ongoing refinement, particularly in specialized domains like neurophysiology. These insights inform efforts to improve language models' reliability and performance.

Study Findings

In summary, this study assessed LLMs' proficiency in neurophysiology and identified strengths and weaknesses. While ChatGPT, Bard, and Claude handled fundamental concepts well, they needed help with complex reasoning and integrating knowledge across topics. They excelled in general neurophysiology and motor system questions but faced challenges with integrative questions.

Despite no significant differences based on language or cognitive level, inconsistencies and reliance on memorization were noted. The study recommends tailored training and reliable sources to improve LLMs' performance in neurophysiology, offering a solid framework for future enhancements in their knowledge and reasoning abilities.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, May 22). ChatGPT, Bard, and Claude Tackle Neurophysiology Questions. AZoAi. Retrieved on November 23, 2024 from https://www.azoai.com/news/20240522/ChatGPT-Bard-and-Claude-Tackle-Neurophysiology-Questions.aspx.

  • MLA

    Chandrasekar, Silpaja. "ChatGPT, Bard, and Claude Tackle Neurophysiology Questions". AZoAi. 23 November 2024. <https://www.azoai.com/news/20240522/ChatGPT-Bard-and-Claude-Tackle-Neurophysiology-Questions.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "ChatGPT, Bard, and Claude Tackle Neurophysiology Questions". AZoAi. https://www.azoai.com/news/20240522/ChatGPT-Bard-and-Claude-Tackle-Neurophysiology-Questions.aspx. (accessed November 23, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. ChatGPT, Bard, and Claude Tackle Neurophysiology Questions. AZoAi, viewed 23 November 2024, https://www.azoai.com/news/20240522/ChatGPT-Bard-and-Claude-Tackle-Neurophysiology-Questions.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Large Language Models in Astronomy Can Boost Research but Pose Ethical Risks