In an article published in the journal Nature, researchers evaluated the clinical accuracy of generative pre-trained transformers (GPT)-3.5 and GPT-4, as well as two configurations of the Llama 2 open-source large language models (LLMs), in providing initial diagnoses, examination steps, and treatment suggestions for various medical cases.
GPT-4 demonstrated superior performance over GPT-3.5 and Google for diagnosis tasks, with generally better performance on common diseases. While showing promise, weaknesses highlighted the necessity for robust and regulated artificial intelligence (AI) models in healthcare.
Background
The rise of LLMs, particularly exemplified by OpenAI's ChatGPT with versions like GPT-3.5 and GPT-4, has revolutionized various text-based tasks, including text summarization, code generation, and personal assistance. However, concerns have been raised about the accuracy and reliability of these models, especially in critical fields like medicine where misinformation can have severe consequences. While preliminary studies have showcased potential applications of ChatGPT in medical contexts, comprehensive evaluations of their diagnostic and therapeutic capabilities are lacking.
Existing research has primarily focused on simulating medical exams or assisting with medical writing, leaving a gap in assessing their performance in clinical decision-making tasks such as initial diagnosis, examination recommendations, and treatment suggestions across various diseases. This paper aimed to address this gap by conducting a thorough analysis of the clinical accuracy of GPT-3.5 and GPT-4 in handling these tasks, considering the frequency of diseases to account for varying difficulty levels. Additionally, it explored the potential of open-source LLMs like Llama 2 as an alternative.
Methods
The methods employed by the researchers aimed to evaluate the clinical accuracy of LLMs, specifically GPT-3·5 and GPT-4, in performing diagnostic, examination, and treatment tasks across a diverse range of medical cases. Firstly, a comprehensive selection process was undertaken to ensure a representative sample of realistic cases from German clinical casebooks. Cases were categorized based on disease frequency, focusing on cases of rare, less frequent, and frequent diseases.
To generate patient queries, cases were translated into layman's language and presented to the LLMs and Google search engine. Two independent physicians evaluated the output generated by the LLMs and Google using a five-point Likert scale. Additionally, an exploratory analysis was conducted on open-source LLMs, specifically Llama 2, with two different model sizes.
The study powered its analysis for comparisons between LLMs and Google and within LLM groups while also conducting a descriptive analysis of Llama 2 models. The performance evaluation considered the cumulative scores across all three tasks for each LLM, stratified by disease frequency subgroups. This comprehensive approach provided insights into the capabilities of LLMs in clinical decision-making tasks and contributed to understanding their potential applications in healthcare.
Results
The study assessed the clinical accuracy of two successive LLMs, GPT-3·5 and GPT-4, in diagnosing, examining, and treating medical cases, along with the comparison to Google search results. Inter-rater reliability analysis revealed substantial to almost perfect agreement among raters for all tasks. Performance evaluation showed GPT-4 outperforming both GPT-3·5 and Google in diagnosis, with significant differences observed.
Notably, all tools demonstrated better performance for frequent diseases compared to rare ones. In examination, GPT-4 showed superior performance over GPT-3·5, especially for rare diseases. Treatment options comparison indicated slightly better performance of GPT-4 over GPT-3·5, although not statistically significant. These findings suggested the potential of commercial LLMs like GPT-4 in assisting clinical decision-making, particularly in diagnosing medical cases.
Yet, further improvements were needed to enhance performance, especially for rare diseases. Moreover, the comparison with open-source models highlighted the ongoing advancements in LLM technology and the need for continued evaluation to ensure reliability and effectiveness in clinical settings.
Discussion
The researchers comprehensively evaluated GPT-3·5 and GPT-4, along with Google search, in clinical decision support tasks across various disease frequencies. GPT-4 exhibited significant improvement over GPT-3·5, outperforming both GPT-3·5 and Google in diagnosis, examination, and treatment recommendation. However, challenges persisted, particularly in diagnosing rare diseases and refining prompts for accurate responses.
While open-source models like Llama 2 showed promise, they lagged slightly behind their commercial counterparts. The study underscores the evolving role of LLMs in healthcare decision-making, emphasizing the need for continual improvements in accuracy, transparency, and regulatory compliance. Despite advancements, caution was warranted, as LLMs still fell short of consistently high accuracy levels required for standalone medical consultation. The future integration of LLMs in healthcare would necessitate adherence to rigorous regulatory standards and exploring open-source alternatives for enhanced transparency and oversight.
Conclusion
In conclusion, the researchers underscored the potential of advanced LLMs like GPT-4 in clinical decision support, particularly for diagnosing common diseases. While improvements over previous models were evident, challenges remained, especially in diagnosing rare conditions.
Additionally, open-source LLMs showed promise but required further refinement. The findings highlighted the evolving landscape of AI in healthcare and emphasized the need for ongoing evaluation, regulatory compliance, and transparency to ensure safe and effective integration into clinical practice.
Journal reference:
- Sandmann, S., Riepenhausen, S., Plagwitz, L., & Varghese, J. (2024). Systematic analysis of ChatGPT, Google search, and Llama 2 for clinical decision support tasks. Nature Communications, 15(1), 2050. https://doi.org/10.1038/s41467-024-46411-8, https://www.nature.com/articles/s41467-024-46411-8