In an article recently submitted to the arxiv* server, researchers suggested combining large language models (LLMs) with vision encoders to develop a general-purpose X-ray artificial intelligence (AI) model. Traditionally, AI systems in medical imaging have been tailored to specific tasks and have struggled to adapt to new challenges.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
The new approach involves training multimodal models leveraging routinely acquired medical images and corresponding text reports. This methodology not only enables the performance of a diverse array of tasks but also generates rich and expressive results. This breakthrough could usher in a new era of medical AI applications, facilitating tasks such as high-performance zero-shot and data-efficient classification, semantic search, visual question answering (VQA), and ensuring radiology report quality assurance (QA).
Related works
Recent years have witnessed remarkable progress in applying AI to medical imaging, with deep learning systems achieving expert-level performance across diverse medical tasks. However, challenges related to clinical and technical limitations have hindered the widespread deployment and impact of AI in real-world healthcare applications. These challenges include the cost-intensive curation of high-quality training datasets, the confinement of AI to specific tasks, difficulties in handling multimodal data, and the limited interpretability that obstructs effective collaboration between humans and AI.
Traditionally, AI models in medical imaging predominantly relied on vision-only approaches, such as convolutional neural networks (CNNs) and vision transformers. Nevertheless, training these models using conventional supervised methods has proven time-consuming and data-intensive. Furthermore, these models often remain confined to discrete tasks, like image classification and object detection, typically working with images from a single modality. However, clinical workflows often involve a spectrum of inputs, including clinical notes, images, and investigations, when making critical diagnoses and treatment decisions.
In light of these challenges, the fusion of LLMs and vision models presents an intriguing avenue for addressing the limitations of vision-only models in medical imaging, offering the potential for more comprehensive and efficient AI solutions.
Proposed methodology
The developed methodology, labeled "Embeddings for Language/Image-aligned X-Rays" (ELIXR), presents a comprehensive approach for merging language and image encoders. This involves the grafting of an aligned language-image encoder onto a pre-existing fixed LLM, known as Parameterized Language Model 2 (PaLM 2). This hybrid architecture, designed as a lightweight adapter, is configured to handle various tasks. For training this architecture, a dataset consists of images coupled with associated free-text radiology reports obtained from the Medical Information Mart for Intensive Care - Chest X-Ray (MIMIC-CXR) dataset.
To evaluate the efficacy of the approach, evaluations were carried out across different domains. For zero-shot and data-efficient classification, the publicly available CheXpert and ChestX-ray14 datasets, along with a private dataset collected from five medical facilities in India were utilized. To assess the performance of semantic search, experiments were conducted across four categories utilizing the MIMIC-CXR test set. The evaluation of VQA was performed using both the Visual Question Answering in Radiology Dataset Benchmark (VQA-RAD) benchmark and the MIMIC-CXR test set. Additionally, the effectiveness of the output of the Language Model in radiology report quality assurance was evaluated by a board-certified thoracic radiologist, employing the MIMIC-CXR test set. This multi-faceted evaluation aims to comprehensively validate the potential of our ELIXR approach in various critical aspects of medical imaging analysis.
Experimental results
The ELIXR methodology has demonstrated exceptional performance in various aspects of chest X-ray (CXR) analysis. Notably, it achieved remarkable results in zero-shot CXR classification, attaining a mean Area Under the Curve (AUC) of 0.850 across 13 different findings. Furthermore, ELIXR excelled in data-efficient CXR classification, with mean AUCs of 0.893 and 0.898 for five specific findings (atelectasis, cardiomegaly, consolidation, pleural effusion, and pulmonary edema) using only 1% (~2,200 images) and 10% (~22,000 images) of the training data.
The methodology also displayed its prowess in semantic search, achieving a normalized Discounted Cumulative Gain (NDCG) of 0.76 across 19 queries, encompassing an impressive perfect retrieval rate for 12 of them. Regarding data efficiency, ELIXR outperformed existing methods, such as supervised contrastive learning (SupCon), showcasing its capacity by requiring significantly fewer data points, two orders of magnitude less, to achieve comparable performance.
Additionally, ELIXR exhibited promise in CXR vision-language tasks. It achieved an overall accuracy of 58.7% in visual question answering and 62.5% in report quality assurance tasks. These outcomes collectively underscore the robustness and versatility of ELIXR in the realm of CXR AI, marking a significant advancement in the fusion of language and image encoders for medical image analysis.
Conclusions
This study encompassed creating and assessing a streamlined vision-language multimodal model for medical imaging. The training of the model exclusively relied on medical images accompanied by free-text radiology reports sourced from everyday clinical routines. Notably, the method showcased notable efficiency in both computational resources and data requirements during training. The findings highlighted promising results across an array of multimodal radiology tasks, and this endeavor marks an initial stride towards developing a versatile X-ray AI system with broad applicability.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.