In an article recently submitted to the ArXiv* server, researchers introduced MindGenerative Pre-trained Transformer (GPT) to recover seen images from brain signals. This non-invasive neural decoder translates visual stimuli into natural language using functional Magnetic Resonance Imaging (fMRI) signals.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Unlike prior methods, MindGPT's approach maintains fidelity and interpretable results, shedding light on the connection between visual properties and language semantics. Results show the generated word sequences accurately represent visual information, with the higher visual cortex proving more semantically informative than the lower cortex for language decoding tasks.
Background
Highlighting the remarkable human capacity to describe visual experiences with language and suggesting a deep intertwining of semantics with sensory input, especially in vision, neuroscience studies provide evidence of shared amodal semantic representation. For instance, the word "cat" can evoke the mental image of a cat in one's mind. However, the mechanisms underlying semantic relations across modalities and seamless transitions between visual and linguistic modes need quantification and computational models.
Recent neural decoding efforts have demonstrated potential in reconstructing visual content from fMRI data, yet challenges persist in image quality and semantic coherence. The need for a "mind reading" technology to verbally interpret visual stimuli promises to uncover cross-modal semantic integration mechanisms and applications in brain-computer interfaces (BCIs).
Related Work
Previous research in visual neural decoding has evolved with deep learning and neuroscience advances. Researchers have focused on three fundamental paradigms: stimuli classification, recognition, and reconstruction, with a growing emphasis on the challenge of visual reconstruction. This task involves extracting fine-grained image details from fMRI brain activity, transitioning from pixel-level precision to semantically accurate representations, influenced mainly by the emergence of diffusion models.
While earlier techniques excelled at capturing outlines and postures of stimuli, they needed help with intricate textures and colors due to data limitations. In contrast, high-level semantic decoding methods, incorporating visual semantic information into models, produced realistic but detail-limited images.
Proposed Method
The innovative framework, MindGPT aims to generate descriptive sentences based on the brain's activity patterns in response to visual stimuli. It involves several critical components, including dataset selection and preprocessing, Contrastive Language-Image Pre-training (CLIP) guided neural embedding and vision-language joint modeling.
In terms of dataset and preprocessing, MindGPT utilizes the Deformable Image Registration (DIR) dataset, a well-established benchmark for fMRI-based decoding. This dataset involves subjects viewing natural images from ImageNet while collecting fMRI signals. The fMRI data encompasses the brain's visual areas, including V1-V4, Lateral Occipital Complex (LOC), Fusiform Face Area (FFA), and Parahippocampal Place Area (PPA), categorized into lower and higher visual cortex regions. The objective is to align the neural representation with semantic information.
To achieve this, MindGPT employs CLIP-guided neural embedding to guide the model toward the desired visual direction, corresponding to the semantic information of stimulus images. It begins by processing fMRI signals, converting them into a sequence of voxel vectors. Then, it proceeds to project and transform these vectors using a trainable linear projection and a Transformer encoder. CLIP's hidden class embedding is a neural proxy, establishing a shared semantic space between images and fMRI signals. Data augmentation techniques, similar to mixup, are also applied to enhance the model's ability to extract high-level semantic features from augmented images.
MindGPT further integrates vision and language through a joint modeling approach. It employs an autoregressive language model, GPT-2, to generate well-formed sentences based on the brain's activity patterns. Cross-attention layers connect the fMRI encoder and the GPT decoder, optimizing the model for an end-to-end multi-task learning setup. The loss function considers both the cross-entropy loss for language modeling and a mean-squared loss for alignment between the fMRI and image embeddings, with a trade-off hyperparameter to balance the two objectives. This approach facilitates a direct mapping between brain activity and text, guided by visual cues, offering expandability to other neural decoding tasks and reducing information loss.
Experimental Results
This study employed the MindGPT framework, consisting of two pre-trained sub-models, CLIP Vision Transformer (CLIPViT) -B/32 and GPT-2Base, with only the fMRI encoder and cross-attention layers left trainable. Researchers actively utilized a standard Vision Transformer (ViT) model with specific configurations as the fMRI encoder and conducted training using the Adam solver.
The researchers trained the model on the DIR dataset and a subset of ImageNet, and they carried out the implementation in PyTorch on NVIDIA GeForce RTX3090 Graphics Processing Units (GPUs).
Metrics such as Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Consensus-based Image Description Evaluation (CIDEr), and Semantic Propositional Image Caption Evaluation (SPICE) assessed the model's language decoding performance.
The experimental results revealed that the larger model variant, MindGPT-L, outperformed the smaller ones, MindGPT-B and MindGPT-S, across multiple language similarity metrics. Notably, the BLEU-4 score, reflecting precision in matching four consecutive words, showed substantial improvements with the larger model. Additionally, choosing cross-attention layers affected decoding performance, with smaller cross-attention modules demonstrating better performance. However, further research is needed to understand the performance saturation limits.
Conclusion
In summary, this study introduces an innovative, non-invasive decoding approach, combining large vision-language models to bridge the gap between visual and linguistic representations. Initial experiments demonstrate the framework's effectiveness, highlighting the connection between amodal semantic concepts and real-world visible objects. While all this is promising, there are unanswered questions and ongoing challenges, such as quantifying semantic information in visual attention and exploring semantic relations between the visual cortex and the anterior temporal lobe. Overall, this research has implications for understanding how the brain processes sensory information and offers potential therapeutic applications, especially for individuals with semantic dementia.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.