Language models (LMs) have gained significant attention in recent years for their ability to generate human-like text responses. In healthcare, LMs hold immense potential to enhance patient communication, improve medical research, and aid in various aspects of healthcare delivery. However, it is crucial to understand the limitations and challenges associated with these models to ensure their responsible and effective utilization.
An article published in the medrxiv* server explored a comparative analysis of ChatGPT and Bard for anesthesia-related queries and discussed the broader limitations and future directions of LMs in healthcare.
The authors compared ChatGPT and Bard, two popular language models, to evaluate their performance in answering questions about anesthesia from a patient's perspective. The researchers selected commonly asked anesthesia-related questions and provided zero-shot prompts to both models. The generated responses were then evaluated using various metrics to assess readability, linguistic quality, hallucination errors, and computational sentiment analysis.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Comparing ChatGPT and Bard for anesthesia-related questions
The results of the study revealed notable differences between ChatGPT and Bard. ChatGPT provided longer, more intellectual, and more practical responses than Bard. It did not exhibit any "hallucination" errors, indicating that its responses were factually accurate. On the other hand, Bard had a 30.3% error rate in its responses. However, ChatGPT's responses were more difficult to read, with a difficulty level of college-grade reading difficulty level. In contrast, Bard's responses exhibited a more informal and colloquial tone equivalent to an eighth-grade reading level. Despite the readability difference, ChatGPT demonstrated significantly better linguistic quality compared to Bard. Computational sentiment analysis showed that Bard had a higher polarity score, indicating a more positive sentiment in its responses. The subjectivity scores were similar between the two language models.
The study’s findings highlight the potential of LMs in providing accurate and informative responses to anesthesia-related queries. ChatGPT demonstrated a better performance overall, with more intellectual and p responses. However, the greater difficulty level of ChatGPT's responses may pose challenges for some patients who may prefer a more conversational tone like Bard. The study suggests further efforts to incorporate health literacy to enhance patient-clinician communication and improve post-operative patient outcomes.
Limitations of language models in healthcare
The present study also identified limitations associated with LMs and proposed potential future directions for improvement. One significant limitation observed was the variability in responses across different users. Inconsistencies and incomplete or inappropriate responses were prevalent, especially in queries related to language model law (LLM). The study attempted to mitigate this issue by repeating the questions three times, but no significant difference was observed in the overall text outputs. Future investigations should explore the optimal number of query iterations, as different studies have utilized varying numbers without a definitive objective measure for all scenarios.
Another vital aspect highlighted by the study is the influence of initial prompts on LM performance and accuracy. Incorporating chain-of-thought prompting, which involves step-by-step reasoning before answering questions, could be a valuable approach to evaluate the impact of complexity in queries compared to zero-shot or few-shot approaches. The study also proposed exploring various techniques for evaluating text performance metrics in language models using ROUGE-L and METEOR.
Concerns regarding LMs
One well-known problem associated with LMs is the occurrence of "hallucinations." Both ChatGPT and Bard exhibited instances where they generated responses lacking factual accuracy. This phenomenon poses a risk of providing fabricated evidence, which is unacceptable when it comes to communicating facts to patients. Consequently, developing LMs for healthcare tools requires additional scrutiny and comprehension to ensure their reliability and trustworthiness in mainstream healthcare practices.
Conclusion
Despite their limitations, LMs such as ChatGPT and Bard have demonstrated their potential to generate effective responses to patient queries in anesthesia. While ChatGPT excels in technical and descriptive communication, Bard offers a more conversational approach. The optimal utilization of LMs in patient-centric scenarios could involve creating textual content to facilitate efficient patient communication before surgery, summarizing radiological reports, and improving peri-anesthesia patient care. However, it is crucial to acknowledge that LMs cannot fully "understand" queries like humans, nor can they distinguish accurate information from misinformation.
Ultimately, the integration of LMs into healthcare practices should be seen as a complement to human expertise rather than a replacement. The creativity and ethical judgment inherent in clinicians cannot be substituted by technology. Moving forward, it is imperative to address the identified limitations, conduct further research, and strive for responsible integration of LMs in healthcare to enhance patient care and advance medical research in anesthesia.
*Important notice: medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.