Medprompt: Elevating GPT-4 Performance in Medicine and Beyond through Intelligent Prompting Strategies

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.Dec 1 2023

In an article recently submitted to the ArXiV* server, researchers conducted a groundbreaking study delving into Generative Pre-trained Transformer-4 's (GPT-4) capabilities within specialized domains, mainly focusing on its prowess in medicine.

*Study: Medprompt: Elevating GPT-4 Performance in Medicine and Beyond through Intelligent Prompting Strategies. Image credit: NMStudio789/Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Challenging the prevailing notion that specialist abilities necessitate extensive model training with domain-specific knowledge, the study innovatively engineered prompts, culminating in creating "Medprompt." This composite prompting strategy remarkably boosted GPT-4's performance, surpassing specialized models like the Medical Language Model for Patient Care (Med-PaLM2) across nine medical benchmark datasets while showcasing its versatility in various fields beyond medicine, significantly broadening its applicability.

Background

Initially, smaller models like PubMed Bidirectional Encoder Representations from Transformers (PubMedBERT) and Biological Generative Pre-trained Transformer (BioGPT), pre-trained with domain-specific data, performed strongly in biomedical tasks. Contrastingly, larger generalist models like GPT-3.5 and GPT-4 demonstrated impressive performance in medical challenges without domain-specific training. Studies showcased the power of simple prompting techniques to steer these generalist models toward excellence in specialized domains, surpassing technical models like Med-PaLM2 without extensive fine-tuning.

Medprompt: Techniques and Adaptability Overview

The Medprompt approach elaborates on three essential techniques: Dynamic Few-shot, Self-Generated Chain of Thought, and Choice Shuffling Ensemble. The Dynamic Few-shot technique involves leveraging task training sets as a high-quality source for few-shot examples, allowing for selecting different examples for various task inputs. Unlike fixed few-shot examples, this method dynamically identifies semantically similar examples using a k-nearest neighbor (k-NN) clustering mechanism, enhancing adaptability without necessitating extensive fine-tuning or billion-parameter updates.

The Self-Generated Chain of Thought method involves the generation of step-by-step reasoning sequences by GPT-4 for given question-answer pairs, similar to the process undertaken by human experts. However, GPT-4 autonomously generates detailed explanations through a template-based prompting mechanism instead of relying on manual crafting, demonstrating the model's capacity to produce intricate reasoning logic. A verification step is employed to compare the model's generated answer with the ground truth label, ensuring the reliability of the generated rationale and mitigating potential inaccuracies in reasoning chains.

The Choice Shuffling Ensemble technique aims to address position biases in multiple-choice answers exhibited by GPT-4. This method reduces biases and enhances diversity in reasoning paths by shuffling the order of answer choices and checking the consistency of generated answers across different sort orders. This technique contributes to improved ensemble quality and diminishes sensitivity to choice order, thereby refining the model's robustness.

The culmination of these techniques, termed Medprompt, combines intelligent few-shot exemplar selection, a self-generated chain of thought reasoning, and majority vote ensembling. Medprompt's approach integrates dynamic adaptation, automated reasoning, and ensemble-based decision-making, achieving high accuracy on medical benchmark datasets. Although initially designed for medical multiple-choice question answering, Medprompt's versatility suggests broader applications across various problem-solving tasks beyond the medical domain.

The configuration used for Medprompt includes parameters such as five k-NN selected few-shot exemplars and five items in the choice-shuffle ensemble, striking a balance between accuracy and computational cost. Further optimizations, indicated by ablation studies, suggest potential performance gains with increased hyperparameter values. While Medprompt excels in medical benchmarks, its general-purpose nature implies applicability to diverse domains and problem-solving scenarios.

This framework's adaptability and success in achieving record-breaking performance in medical question answering signify its potential for broader applications, transcending the medical domain to encompass various problem-solving tasks and domains. Detailed analyses in subsequent sections shed light on its extensibility and effectiveness in less constrained problem-solving scenarios, further underscoring its versatility and robustness.

Medprompt: Versatility and Superiority Unveiled

Performance Evaluation: Various foundation models showcase their performance in Multi-Modal Medical Question Answering (MultiMedQA) multiple-choice components. Notably, GPT-4 with Medprompt outperforms all other models across every benchmark, achieving state-of-the-art results. The Medprompt strategy achieves a remarkable accuracy of 90.2% across nine diverse benchmark datasets, surpassing Flan-PaLM540B and Med-PaLM2, both fine-tuned on subsets of these benchmarks.

Evaluation of Eyes-Off Data: An assessment was conducted on an "eyes-off" subset of each benchmark dataset to assess Medprompt's performance concerning overfitting risks. GPT-4 with Medprompt demonstrates an average accuracy of 90.6% on "eyes-on" data and 91.3% on "eyes-off" data, indicating minimal overfitting risks across MultiMedQA datasets. Moreover, the superior performance on the eyes-off dataset in 6 out of 9 benchmarks underscores Medprompt's robustness.

Insights from Ablation Studies: The ablation study dissects Medprompt's components, revealing their relative contributions. Chain-of-thought reasoning steps exhibit the most significant impact (+3.4%), followed by dynamic few-shot exemplars and choice shuffling ensembling (+2.2% each), enhancing Medprompt's performance on the MedQA dataset.

Expert vs. GPT-4 CoT Comparison: Comparing the accuracy of expert-crafted chain-of-thought (CoT) prompts from Med-PaLM2 with GPT-4's self-generated CoT prompts. GPT-4's self-generated CoT outperforms the expert-crafted version by 3.1 absolute points on the MedQA dataset. GPT-4's CoT showcases finer-grained reasoning logic, leveraging its strengths and potential neutrality compared to the expert-crafted CoT.

Generalization Across Domains: Medprompt's adaptability extends beyond medical question answering, as evidenced by its performance on diverse datasets across various subjects. It consistently outperforms zero-shot baselines, demonstrating its applicability across diverse problem-solving tasks.

Conclusion

To sum up, the study delved into the efficacy of prompting strategies to enhance GPT-4's performance in medical problem-solving without extensive fine-tuning or expert-crafted prompts. Introducing Medprompt, a composite prompting approach significantly improving GPT-4's accuracy across various medical question-answering datasets, surpassing specialist models. Ablation studies highlighted the pivotal role of individual components within Medprompt.

Evaluations on diverse fields showcased Medprompt's adaptability beyond medicine. Envisioning further research avenues to leverage Medprompt's capabilities across multiple disciplines and explore its potential in generating powerful prompts for non-multiple-choice questions. Apart from focusing on prompting, it's essential to recognize the importance of fine-tuning and parametric updates in enhancing the potential of foundation models, particularly in crucial domains such as healthcare.

Journal reference:

Preliminary scientific report. Nori, H., et al. (2023, November 27). Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. ArXiv. https://doi.org/10.48550/arXiv.2311.16452, https://arxiv.org/pdf/2311.16452.pdf.

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2023, December 01). Medprompt: Elevating GPT-4 Performance in Medicine and Beyond through Intelligent Prompting Strategies. AZoAi. Retrieved on July 01, 2025 from https://www.azoai.com/news/20231201/Medprompt-Elevating-GPT-4-Performance-in-Medicine-and-Beyond-through-Intelligent-Prompting-Strategies.aspx.
MLA
Chandrasekar, Silpaja. "Medprompt: Elevating GPT-4 Performance in Medicine and Beyond through Intelligent Prompting Strategies". AZoAi. 01 July 2025. <https://www.azoai.com/news/20231201/Medprompt-Elevating-GPT-4-Performance-in-Medicine-and-Beyond-through-Intelligent-Prompting-Strategies.aspx>.
Chicago
Chandrasekar, Silpaja. "Medprompt: Elevating GPT-4 Performance in Medicine and Beyond through Intelligent Prompting Strategies". AZoAi. https://www.azoai.com/news/20231201/Medprompt-Elevating-GPT-4-Performance-in-Medicine-and-Beyond-through-Intelligent-Prompting-Strategies.aspx. (accessed July 01, 2025).
Harvard
Chandrasekar, Silpaja. 2023. Medprompt: Elevating GPT-4 Performance in Medicine and Beyond through Intelligent Prompting Strategies. AZoAi, viewed 01 July 2025, https://www.azoai.com/news/20231201/Medprompt-Elevating-GPT-4-Performance-in-Medicine-and-Beyond-through-Intelligent-Prompting-Strategies.aspx.