MindGPT: Bridging the Gap Between Brain Signals and Visual Descriptions

In an article recently submitted to the ArXiv* server, researchers introduced MindGenerative Pre-trained Transformer (GPT) to recover seen images from brain signals. This non-invasive neural decoder translates visual stimuli into natural language using functional Magnetic Resonance Imaging (fMRI) signals.

Study: MindGPT: Bridging the Gap Between Brain Signals and Visual Descriptions. Image credit: metamorworks/Shutterstock
Study: MindGPT: Bridging the Gap Between Brain Signals and Visual Descriptions. Image credit: metamorworks/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Unlike prior methods, MindGPT's approach maintains fidelity and interpretable results, shedding light on the connection between visual properties and language semantics. Results show the generated word sequences accurately represent visual information, with the higher visual cortex proving more semantically informative than the lower cortex for language decoding tasks.

Background

Highlighting the remarkable human capacity to describe visual experiences with language and suggesting a deep intertwining of semantics with sensory input, especially in vision, neuroscience studies provide evidence of shared amodal semantic representation. For instance, the word "cat" can evoke the mental image of a cat in one's mind. However, the mechanisms underlying semantic relations across modalities and seamless transitions between visual and linguistic modes need quantification and computational models.

Recent neural decoding efforts have demonstrated potential in reconstructing visual content from fMRI data, yet challenges persist in image quality and semantic coherence. The need for a "mind reading" technology to verbally interpret visual stimuli promises to uncover cross-modal semantic integration mechanisms and applications in brain-computer interfaces (BCIs).

Related Work

Previous research in visual neural decoding has evolved with deep learning and neuroscience advances. Researchers have focused on three fundamental paradigms: stimuli classification, recognition, and reconstruction, with a growing emphasis on the challenge of visual reconstruction. This task involves extracting fine-grained image details from fMRI brain activity, transitioning from pixel-level precision to semantically accurate representations, influenced mainly by the emergence of diffusion models.

While earlier techniques excelled at capturing outlines and postures of stimuli, they needed help with intricate textures and colors due to data limitations. In contrast, high-level semantic decoding methods, incorporating visual semantic information into models, produced realistic but detail-limited images.

Proposed Method

The innovative framework, MindGPT aims to generate descriptive sentences based on the brain's activity patterns in response to visual stimuli. It involves several critical components, including dataset selection and preprocessing, Contrastive Language-Image Pre-training (CLIP) guided neural embedding and vision-language joint modeling.

In terms of dataset and preprocessing, MindGPT utilizes the Deformable Image Registration (DIR) dataset, a well-established benchmark for fMRI-based decoding. This dataset involves subjects viewing natural images from ImageNet while collecting fMRI signals. The fMRI data encompasses the brain's visual areas, including V1-V4, Lateral Occipital Complex (LOC), Fusiform Face Area (FFA), and Parahippocampal Place Area (PPA), categorized into lower and higher visual cortex regions. The objective is to align the neural representation with semantic information.

To achieve this, MindGPT employs CLIP-guided neural embedding to guide the model toward the desired visual direction, corresponding to the semantic information of stimulus images. It begins by processing fMRI signals, converting them into a sequence of voxel vectors. Then, it proceeds to project and transform these vectors using a trainable linear projection and a Transformer encoder. CLIP's hidden class embedding is a neural proxy, establishing a shared semantic space between images and fMRI signals. Data augmentation techniques, similar to mixup, are also applied to enhance the model's ability to extract high-level semantic features from augmented images.

MindGPT further integrates vision and language through a joint modeling approach. It employs an autoregressive language model, GPT-2, to generate well-formed sentences based on the brain's activity patterns. Cross-attention layers connect the fMRI encoder and the GPT decoder, optimizing the model for an end-to-end multi-task learning setup. The loss function considers both the cross-entropy loss for language modeling and a mean-squared loss for alignment between the fMRI and image embeddings, with a trade-off hyperparameter to balance the two objectives. This approach facilitates a direct mapping between brain activity and text, guided by visual cues, offering expandability to other neural decoding tasks and reducing information loss.

Experimental Results

This study employed the MindGPT framework, consisting of two pre-trained sub-models, CLIP Vision Transformer (CLIPViT) -B/32 and GPT-2Base, with only the fMRI encoder and cross-attention layers left trainable. Researchers actively utilized a standard Vision Transformer (ViT) model with specific configurations as the fMRI encoder and conducted training using the Adam solver.

The researchers trained the model on the DIR dataset and a subset of ImageNet, and they carried out the implementation in PyTorch on NVIDIA GeForce RTX3090 Graphics Processing Units (GPUs).

Metrics such as Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE) assessed the model's language decoding performance.

The experimental results revealed that the larger model variant, MindGPT-L, outperformed the smaller ones, MindGPT-B and MindGPT-S, across multiple language similarity metrics. Notably, the BLEU-4 score, reflecting precision in matching four consecutive words, showed substantial improvements with the larger model. Additionally, choosing cross-attention layers affected decoding performance, with smaller cross-attention modules demonstrating better performance. However, further research is needed to understand the performance saturation limits.

Conclusion

In summary, this study introduces an innovative, non-invasive decoding approach, combining large vision-language models to bridge the gap between visual and linguistic representations. Initial experiments demonstrate the framework's effectiveness, highlighting the connection between amodal semantic concepts and real-world visible objects. While all this is promising, there are unanswered questions and ongoing challenges, such as quantifying semantic information in visual attention and exploring semantic relations between the visual cortex and the anterior temporal lobe. Overall, this research has implications for understanding how the brain processes sensory information and offers potential therapeutic applications, especially for individuals with semantic dementia.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as conclusive, guide clinical practice/health-related behavior, or treated as established information.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2023, October 02). MindGPT: Bridging the Gap Between Brain Signals and Visual Descriptions. AZoAi. Retrieved on October 05, 2024 from https://www.azoai.com/news/20231002/MindGPT-Bridging-the-Gap-Between-Brain-Signals-and-Visual-Descriptions.aspx.

  • MLA

    Chandrasekar, Silpaja. "MindGPT: Bridging the Gap Between Brain Signals and Visual Descriptions". AZoAi. 05 October 2024. <https://www.azoai.com/news/20231002/MindGPT-Bridging-the-Gap-Between-Brain-Signals-and-Visual-Descriptions.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "MindGPT: Bridging the Gap Between Brain Signals and Visual Descriptions". AZoAi. https://www.azoai.com/news/20231002/MindGPT-Bridging-the-Gap-Between-Brain-Signals-and-Visual-Descriptions.aspx. (accessed October 05, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2023. MindGPT: Bridging the Gap Between Brain Signals and Visual Descriptions. AZoAi, viewed 05 October 2024, https://www.azoai.com/news/20231002/MindGPT-Bridging-the-Gap-Between-Brain-Signals-and-Visual-Descriptions.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.