In an article posted to the Arxiv* server, researchers proposed a Transformer model with CoAttention gated vision language (CAT-ViL) embedding for surgical visual question localized answering (VQLA) tasks to help medical students/junior surgeons to understand surgical scenarios better.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Expert knowledge in the field of medical sciences is often acquired through extensive training and study. Specialists and senior surgeons answer different questions from junior surgeons/doctors and medical students when they are learning surgery/facing a surgical scenario to improve their understanding of complex surgical scenarios. However, an insufficient number of senior surgeons and high clinical and academic workloads on the working specialists increase their challenges in finding adequate time to guide students individually.
Although automated solutions, such as training systems, surgical simulation, and pre-recorded videos, have been proposed that can assist students to learn surgical procedures, skills, and knowledge, several of their questions still need to be answered by experts.
Recently, studies have displayed the feasibility of developing reliable and safe visual question-answering (VQA) models in the medical domain. Specifically, the Surgical-VQA effectively answered questions regarding organs and tools in robotic surgery.
However, the deep learning-based VQA models have been unable to help students to understand complex surgeries better. For instance, the VQA models cannot indicate the surgical tool location and tissue involved in the surgical scene when students ask them about tissue-tool interaction for a specific surgical tool. Additionally, datasets with annotation in the medical domain are required for sentence-based Surgical-VQA models. Manual annotation is extremely laborious and time-consuming.
Several studies have been performed on VQA tasks in the computer vision domain. VQA models using Transformer, attention modules, and long-short term memory modules can significantly improve the performance in VQA tasks.
Moreover, a unified Transformer model has also been proposed for ViL and joint object detection tasks. However, the object detection results significantly influence the VQA performance in these models, which can hinder the comprehensive understanding of the surgical scene. Additionally, several VQA models use attention, scalar product, averaging, or additive mechanisms while fusing heterogeneous textual and visual features.
However, the best intermediate representation from the heterogeneous features cannot be achieved using simple techniques as the meaning of every feature is different in heterogeneous feature fusion. The VQA model also cannot identify specific regions in an image that are relevant to the answer and question. VQLA system can overcome the limitations of VQA models and help junior surgeons and medical students to understand and learn from recorded surgical videos.
The study
In the present paper, researchers proposed an end-to-end Transformer model with CAT-ViL embedding module for VQLA tasks in surgical scenarios. The embedding module was designed to fuse the heterogeneous features from textual and visual sources. No detection models were required for feature extraction in this model.
The fused embedding was fed to a standard Data Efficient Image Transformer (DeiT) module before the parallel detector and classifier for joint prediction. The proposed model was validated experimentally using public robotic surgical videos obtained from Medical Image Computing and Computer Assisted Intervention (MICCAI) Endoscopic Vision (EndoVis) Challenge 2018 and 2017.
This publicly available dataset was utilized as an external validation dataset to demonstrate the generalization capability of the model in different surgical domains. The CAT-ViL DeiT processed the information from various modalities and implemented the VQLA task in the surgical scenario. DeiT served as the backbone of the network that contained a standard DeiT module, task-specific heads, a CAT-ViL embedding module, a customized trained tokenizer, and a vision feature extractor.
Researchers experimentally validated their model by comparing the CAT-ViL DeiT performance on generating location and answers against state-of-the-art (SOTA) methods, including BlockTucker, multimodal factorized high order pooling (MFH), MUTAN, VQA-DeiT, modular co-attention network (MCAN), Visual bidirectional encoder representations from transformers (VisualBERT) residual multi-layer perceptrons (ResMLP), and VisualBERT.
In VQA-DeiT, a pre-trained DeiT-Base block was used as a replacement for the multilayer Transformer module in VisualBERT. Mean intersection over union (mIoU), f-score, and accuracy were the evaluation metrics during the experimental validation.
All models were trained using Adam optimizer with PyTorch on NVIDIA RTX 3090 GPUs. The batch size, learning rate, and epoch were set to 64, 1 × 10−5, and 80, respectively.
Significance of the study
The proposed Transformer model with CAT-ViL embedding module for surgical VQLA tasks effectively provided localized answers based on a specific surgical scenario and the related question. Specifically, the CAT-ViL embedding module optimally facilitated the fusion and interaction of multimodal features.
In the CAT-ViL embedding, the co-attention module allowed instructive interaction with visual embeddings in text embeddings, while the best intermediate representation for heterogeneous embeddings was identified by the gated module.
Several ablation, robustness, and comparative experiments demonstrated excellent performance and stability of the proposed model against every SOTA method in both localization and question-answering tasks, indicating the model’s potential for real-world and real-time applications.
Thus, the Transformer-based VQLA model displayed the potential of AI-based VQLA systems in surgical scenarios and surgical training understanding. The comparison of detection-free and detection-based feature extractors eliminated the computationally error-prone and costly detection proposals to achieve end-to-end real-time applications and superior representation learning.
To summarize, the study's findings showed that the proposed model for surgical VQLA tasks could effectively assist junior surgeons/medical students in understanding the surgical scene. However, more research is required to improve and quantify the uncertainty and reliability of these safety-critical tasks in the medical field.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.