CRAFT-MD Framework Redefines AI’s Clinical Readiness

Download PDF Copy

Reviewed by Joel ScanlonJan 5 2025

Despite acing tests, AI tools stumble in realistic medical interactions—CRAFT-MD unveils their diagnostic challenges and paves the way for smarter, real-world-ready healthcare solutions.

Research: An evaluation framework for clinical use of large language models in patient interaction tasks. Image Credit: LALAKA / Shutterstock

Artificial intelligence tools such as ChatGPT have been touted for their promise to alleviate clinician workload by triaging patients, taking medical histories, and even providing preliminary diagnoses.

These tools, known as large-language models, are already helping patients understand their symptoms and medical test results.

But while these AI models perform impressively on standardized medical tests, how well do they fare in situations that more closely mimic the real world?

Not that great, according to the findings of a new study led by researchers at Harvard Medical School and Stanford University.

For their analysis, published on Jan. 2 in the journal Nature Medicine, the researchers designed an evaluation framework—or a test—called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine). They deployed it on four large-language models (LLMs), including GPT-4, GPT-3.5, Mistral, and LLaMA-2-7b, to see how well they performed in settings closely mimicking actual patient interactions.

All four large-language models performed well on medical exam-style questions, but their performance worsened when engaged in conversations that more closely resembled real-world interactions. For instance, GPT-4’s diagnostic accuracy dropped from 0.820 to 0.627 in multi-turn conversations. Similarly, GPT-3.5 and Mistral showed significant declines in performance, highlighting the challenges these models face in dynamic clinical scenarios.

The researchers said this gap underscores a twofold need: first, more realistic evaluations that better gauge the fitness of clinical AI models for use in the real world; second, improving these tools’ capabilities to integrate scattered patient information and engage in nuanced reasoning.

Evaluation tools like CRAFT-MD, the research team said, can not only assess AI models more accurately for real-world fitness but could also help optimize their performance in the clinic. The framework processes 10,000 multi-turn conversations in just 48 to 72 hours, compared to 1,150 hours required for human-based evaluation methods.

"Our work reveals a striking paradox—while these AI models excel at medical board exams, they struggle with the basic back-and-forth of a doctor's visit," said study senior author Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. "The dynamic nature of medical conversations—the need to ask the right questions at the right time, to piece together scattered information, and to reason through symptoms—poses unique challenges that go far beyond answering multiple-choice questions. When we switch from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy."

A Better Test to Check AI's Real-World Performance

Right now, developers test the performance of AI models by asking them to answer multiple-choice medical questions, typically derived from the national exam for graduating medical students or from tests given to medical residents as part of their certification.

"This approach assumes that all relevant information is presented clearly and concisely, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world this process is far messier," said study co-first author Shreya Johri, a doctoral student in the Rajpurkar Lab at Harvard Medical School. "We need a testing framework that reflects reality better and is, therefore, better at predicting how well a model would perform."

CRAFT-MD was designed to be one such more realistic gauge.

To simulate real-world interactions, CRAFT-MD evaluates how well large-language models can collect information about symptoms, medications, and family history and then make a diagnosis. An AI agent is used to pose as a patient, answering questions in a conversational, natural style. Another AI agent grades the accuracy of the final diagnosis rendered by the large-language model. Medical experts also review a subset of the outcomes to provide qualitative insights and ensure evaluation reliability.

The researchers used CRAFT-MD to test four AI models—both proprietary or commercial and open-source ones—for performance in 2,000 clinical vignettes featuring conditions common in primary care and across 12 medical specialties, including dermatology, cardiology, and neurology.

All AI models showed limitations, particularly in their ability to conduct clinical conversations and reason based on information given by patients. That, in turn, compromised their ability to take medical histories and render appropriate diagnoses. For example, GPT-4’s diagnostic accuracy decreased further when conversations involved multiple turns rather than static case summaries, illustrating the challenges of synthesizing scattered information over time. These models also performed worse when engaged in back-and-forth exchanges—as most real-world conversations are—rather than when engaged in summarized conversations.

Recommendations for Optimizing AI's Real-World Performance

Based on these findings, the team offers recommendations for AI developers who design AI models and for regulators who evaluate and approve these tools.

These include:

Use of conversational, open-ended questions that more accurately mirror unstructured doctor-patient interactions in the design, training, and testing of AI tools.
Assessing models for their ability to ask the right questions and to extract the most essential information.
Designing models capable of following multiple conversations and integrating information from them.
Incorporating multimodal data, such as medical images, into AI models to improve diagnostic accuracy.
Developing AI agents that can interpret non-verbal cues, such as tone, facial expressions, and body language.

Additionally, the researchers recommend that the evaluation should include both AI agents and human experts because relying solely on human experts is labor-intensive and expensive. For example, CRAFT-MD outpaced human evaluators, processing 10,000 conversations in 48 to 72 hours, plus 15 to 16 hours of expert evaluation. In contrast, a fully human-based approach would require an estimated 1,150 hours for comparable tasks. Using AI evaluators as the first line has the added advantage of eliminating the risk of exposing real patients to unverified AI tools.

The researchers expect CRAFT-MD to be periodically updated and optimized to integrate improved patient-AI models. This adaptability ensures the framework remains relevant as AI technologies evolve.

"As a physician-scientist, I am interested in AI models that can augment clinical practice effectively and ethically," said study co-senior author Roxana Daneshjou, assistant professor of Biomedical Data Science and Dermatology at Stanford University. "CRAFT-MD creates a framework that more closely mirrors real-world interactions and thus helps move the field forward when it comes to testing AI model performance in health care."

Source:

Harvard Medical School

Journal reference:

Johri, S., Jeong, J., Tran, B. A., Schlessinger, D. I., Wongvibulsin, S., Barnes, L. A., Zhou, H., Cai, Z. R., Van Allen, E. M., Kim, D., Daneshjou, R., & Rajpurkar, P. (2025). An evaluation framework for clinical use of large language models in patient interaction tasks. Nature Medicine, 1-10. DOI: 10.1038/s41591-024-03328-5, https://www.nature.com/articles/s41591-024-03328-5

Posted in: AI Research News | AI Product News