In a paper published in the journal Bioengineering, researchers addressed the challenge of applying deep learning in medicine by introducing explainable artificial intelligence (XAI). A high-accuracy computer vision model was used for medical text tasks and employed gradient-weighted class activation mapping (Grad-CAM) for intuitive visualization. Their system comprised four modules: pre-processing, word embedding, classification, and visualization.
After comparing various word embeddings and classifiers, a ResNet-based model on formalized medical text achieved the best performance. This new approach combines ResNet and Grad-CAM and provided both high-accuracy classification and intuitive visualization of focus words during predictions.
Background
In recent years, artificial intelligence (AI) technology has advanced significantly by offering promising applications across various domains. In healthcare, AI has the potential to improve diagnostics and patient care. However, the opacity of AI algorithms poses ethical and practical challenges that lead to skepticism about their real-world performance.
To tackle this challenge, explainable AI (XAI) has been used, underscoring the importance of AI models offering comprehensible explanations for their decisions. This need for transparency is particularly critical in medical applications, where gaining trust and fostering acceptance are paramount goals.
Related Work
Prior research has underscored the difficulties associated with the lack of transparency in AI models, especially in healthcare, where comprehending decision-making processes is essential. One notable XAI technique is Grad-CAM, which generates intuitive heat maps to illustrate AI model focus areas during predictions. Beyond diagnostics, AI has the potential to enhance healthcare by supporting clinical decisions, reducing errors, and providing real-time health risk assessments. Rule-based AI systems have been employed in healthcare but face scalability and manual adjustment challenges. In medical text processing, AI focuses on tasks like entity recognition and relationship extraction.
Proposed Method
This study applied computer vision models to transfer learning for text-processing tasks and utilized the Grad-CAM method for model explainability. Word2Vec served as the word-embedding tool, while ResNet was the primary classifier. The dataset included clinical text data categorized into five classes. The adaptation of Grad-CAM, originally developed for computer vision, allowed for the interpretation of text models by generating thermal phase maps. This adaptation provided valuable insights into the decision-making process of the models. The study's experiments utilized Python, PyTorch for ResNet and Grad-CAM, and a Word2Vec model from Gensim.
The dataset contained 14,438 clinical texts across five categories. Each record was transformed into a 25x25 format to facilitate image-processing algorithms. Word2Vec generated word vectors with a window size of 100 and dimension 100. The Bidirectional Encoder Representations from Transformers (BERT) model was also employed for word embedding. ResNet18 and 1D convolutional neural network (CNN) models were employed for classification, with the text treated as multi-channel images. Additionally, traditional methods like Naïve Bayes were employed for comparison, using tf-idf features. The Grad-CAM module was integrated into the CNN classifier for attention visualization.
Experimental Analysis
The 2D ResNet18 is fine-tuned with pre-trained parameters on Kaggle's medical text dataset to achieve the highest performance among the models. In contrast, the traditional text classifier Naïve Bayes yielded considerably lower weighted F1 scores of 42.2% and 47.8%, respectively. Models based on AlexNet and VGG11 exhibited inferior performance, with accuracy rates falling below that of Naïve Bayes.
The models were considered for comparison to explore variations in the performance of ResNet18 under different conditions, including (1) fine-tuning both input and output layers of ResNet using pre-trained ImageNet parameters and further training on the medical text dataset, (2) fine-tuning only the input and output layers while training on the medical text dataset; (3) fine-tuning input and output layers with parameters pre-trained on ImageNet. Among these models, the one exhibiting the best performance at Epoch 25 was selected.
The model displayed convergence and achieved an accuracy exceeding 90% in both training and validation when visualizing the training and validation accuracies of the ResNet18 model over 25 epochs. This study successfully applied ResNet to medical text-processing tasks and leveraged Grad-CAM for model interpretability. In medical texts, Grad-CAM's usage for generating heatmaps to achieve interpretability is a novel contribution.
The interpretability of Grad-CAM remains qualitative rather than quantitative despite providing valuable model explainability. Therefore, the suitability of the provided explanations still depends on clinical assessments, which poses a subjectivity challenge. Future research should explore standardized criteria for assessing model interpretability, possibly adopting metrics like SHapley additive explanation (SHAP) and locally interpretable model agnostic explanations (LIME) to gauge explainability. Additionally, the findings hint at the need for XAI to elucidate the factors most influential in model outcomes and to find suitable linguistic expressions for interpretation, especially for complex "black-box" models like ANNs. Future endeavors should aim to pinpoint these influential factors through ablation experiments and introduce explicit linguistic expressions to enhance XAI's capabilities.
Conclusion
To sum up, ResNet was employed for medical text processing to enhance classification accuracy. Additionally, Grad-CAM visualization effectively highlighted the model's attention during predictions and also contributed to model interpretability. Looking ahead, future endeavors will focus on establishing quantitative criteria to assess Grad-CAM's explainability for performance comparisons with state-of-the-art models.