In an article published in the journal npj Digital Medicine, researchers from the USA introduced an innovative approach using large language models (LLMs) to automatically identify six categories of social determinants of health (SDoH) from clinical narratives. The categories include employment, housing, transportation, parental status, relationship, and social support.
Background
LLMs are artificial intelligence (AI) systems that can process and generate natural language at a large scale. These models are trained on massive amounts of text data, such as web pages, books, and news articles, and learn to capture the patterns and meanings of language. They have shown remarkable performance in various natural language processing (NLP) tasks, such as text summarization, translation, and question-answering. However, applying LLMs to clinical NLP poses several challenges such as data scarcity, class imbalance, data privacy, and model bias. LLMs are still an emerging area of research, particularly in the context of the extraction of SDoH information from electronic health records (EHR).
SDoH refers to the set of social, economic, and environmental factors influencing an individual's health and well-being. This includes elements such as access to healthcare, socioeconomic status, education, employment, social support networks, and the physical environment. The concept highlights that health outcomes are shaped by factors beyond medical care and genetics, underscoring the importance of addressing social and economic inequalities to improve overall health and reduce disparities in populations. Unfortunately, these factors are often not documented in a structured form within the free text of EHR. As a result, manual extraction is time-consuming, challenging, and labor-intensive, making it difficult to utilize this information for research and clinical care.
LLMs can address this challenge by automating the abstraction of SDoH from clinical texts. However, there are challenges due to class imbalance and data limitations, as this important information is rarely documented. While previous studies have shown the feasibility of NLP in extracting various SDoHs, there is a need to optimize their performance for high-risk medical domains. Additionally, it is important to evaluate the effectiveness of state-of-the-art LLMs in this specific task.
About the Research
In the present paper, the authors aimed to investigate the most effective methods for using LLMs to extract six SDoH from narrative text in the EHR. They utilized a dataset containing 800 clinic notes from 770 cancer patients at a hospital in Boston, Massachusetts. These patients underwent radiotherapy at the hospital between 2015 and 2022. The notes were manually annotated for the presence of six SDoH categories and their attributes, such as adverse or protective. The researchers also created two out-of-domain test datasets from patients with cancer treated with immunotherapy and patients in intensive care units.
The study proposed and evaluated several multilabel classifiers based on different LLM architectures, such as bidirectional encoder representations from transformers (BERT) and flan-text-to-text transfer transformers (Flan-T5). It compared them with zero- and few-shot learning methods using chat generative pre-trained transformer (ChatGPT) family models, such as GPT3.5 and GPT4. Additionally, the research explored the role of synthetic data augmentation using ChatGPT-family models to generate additional SDoH sentences for training. Moreover, the authors assessed the potential bias of LLMs in predicting SDoH labels when demographic information, such as race/ethnicity and gender, was added to the text. Finally, they compared the text-extracted SDoH information with the structured Z-codes entered in the EHR as a proxy for SDoH documentation.
Research Findings
The outcomes showed that the best-performing models for any SDoH mention task were Flan-T5 XL and XXL using synthetic data augmentation, and for adverse SDoH mention task was Flan-T5 XL without synthetic data augmentation. These models achieved macro-F1 scores of 0.71 and 0.70 for any SDoH mention and adverse SDoH mention, respectively.
Flan-T5 models consistently outperformed BERT, with model performance scaling according to size. Synthetic data augmentation proved most beneficial for classes with few instances in the training data, particularly those with low performance on gold-only data. Ablation studies indicated that including synthetic data in training required only approximately half of the gold-labeled dataset to maintain performance.
Additionally, the study found that fine-tuned models surpassed ChatGPT-family models in zero- and few-shot learning for most SDoH classes. Fine-tuned models were also less sensitive to the injection of demographic descriptors. Notably, ChatGPT-family models exhibited a higher likelihood of changing their classification when a female gender was injected compared to a male gender for any SDoH task.
Furthermore, the study highlighted that text-extracted SDoH information identified 93.8% of patients with adverse SDoH, while the International Classification of Diseases (ICD)-10 codes captured only 2.0%.
Conclusion
In summary, this novel method of using LLMs to extract SDoH information from clinical text is efficient and scalable. It can improve the real-world evidence on SDoH and assist in identifying patients who could benefit from resource reports. The paper highlighted the role of synthetic data augmentation, explored the performance of zero- and few-shot learning with ChatGPT-family models, and investigated the potential bias of LLM predictions across patient populations.
The findings demonstrated that fine-tuned models exhibit higher robustness and accuracy compared to zero- and few-shot learning methods using ChatGPT-family models and BERT models. Moreover, the study illustrated the ability of synthetic data augmentation to enhance performance.
The researchers acknowledged the challenges and limitations of using LLMs in the medical domain, such as data scarcity, class imbalance, and algorithmic bias. They suggested future directions for improving the quality and diversity of synthetic data, optimizing the prompting methods for ChatGPT-family models, and assessing the generalizability and robustness of LLMs across different patient populations and clinical settings.