Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training

In an article recently published in the journal Nature Machine Intelligence, researchers demonstrated that models using natural language feedback and more diverse, larger training sets can better predict brain response to complex, real-world scenes.

Study: Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training. Image credit: Generated using DALL.E.3
Study: Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training. Image credit: Generated using DALL.E.3

Background

Advancements in deep learning have enabled deep neural networks that share learned representations and task goals with natural systems to predict brain responses. However, most models utilized for brain response prediction are based on ImageNet pre-training and learn a low-dimensional task objective, such as categorization.

Natural vision has incorporated different language, conceptual, and perceptual sources during its evolution over millions of years and can solve several tasks. Considering such multimodal sources during network training, such as by including complex datasets containing human-relevant information, is a significant challenge for comprehending such biological systems.

Recently, state-of-the-art models have demonstrated substantially improved performance in both language and vision tasks. These advances can be attributed to learning more complex human semantics from several modalities and using more diverse and larger training sets compared to the sets utilized in previous models.

Specifically, language can effectively highlight behaviorally relevant aspects of the world and can draw attention to human semantics in the training data. Similarly, larger training set sizes can provide more and better high-dimensional supervisory signal examples.

The proposed approach

In this study, researchers investigated whether higher-performing models using natural language feedback and more diverse, larger training sets could predict brain response to real-world, complex scenes in a better manner. Specifically, they evaluated and quantified the contributions of large-scale, diverse multimodal pre-training provided using contrastive language-image pre-training (CLIP) to generate semantically grounded representations of natural scenes.

Researchers used models pre-trained using CLIP to represent the class of models leveraging supervision from scene images/vision for language and from image captions/natural language for vision to study visual representations. CLIP learns image embeddings that best match the image caption text embeddings from large, diverse datasets.

Model learning using CLIP is more similar to human visual learning, where the top-down knowledge primarily influences the earliest visual pathway layers. Additionally, CLIP’s image and natural language pre-training and diverse, large training sets can effectively capture the fine-grained visual experience of humans.  Thus, CLIP is a suitable model to investigate brain prediction along several dimensions.

Moreover, the impact of various model architectures can be evaluated due to the versatility of the CLIP scheme, while the impact of dataset diversity and size can be explored by performing controlled comparisons using related datasets and models.

Researchers extracted network representations from CLIP-trained neural network models, including ViTCLIP and ResNetCLIP, and from multiple single modality models, including ImageNet pre-trained ResNet50 and BERT. Subsequently, voxelwise encoding models based on CLIP image features were developed to predict brain responses generated from viewing the Natural Scenes Dataset (NSD) images.

Several open-source models were selected to directly compare with CLIP based on four factors: data diversity, dataset size, feedback, and architecture. The models were simCLR, a self-supervised model, SLIP, a self-supervised model including language feedback, and many open versions of CLIP.

These models were trained using datasets that included 15 million/400 million/two billion image/caption pairs. Encoding models were then constructed using these networks to explain responses from NSD to accurately assess and quantify the contributions of data diversity, dataset size, pre-training, and architecture.

Significance of the study

Results demonstrated that the brain prediction performance was constantly higher for CLIP compared to other models. Models using CLIP led to encoding models that displayed better performance in predicting high-level visual representations in the human brain compared to single modality models pre-trained using less diverse and smaller datasets.

ResNet50 with CLIP explained up to 79% of the variance in individual voxel responses in the held-out test data, which was significantly higher compared to models trained using only text/BERT or image/label pairs/ ImageNet trained ResNet. Comparative evaluation of different model backbones confirmed that network architecture had no significant role in improving the performance of models using CLIP.

However, comparisons across models controlled for data diversity and dataset size demonstrated that natural language feedback and data diversity in larger datasets were important factors in explaining neural responses in high-level visual brain regions.

The improvements were attributed primarily to data diversity and joint image/caption training provided by this data beyond a specific training dataset size. Improvement with language feedback was observed even when dataset factors were controlled.

Moreover, visual brain responses could also be predicted successfully using only image captions, indicating that CLIP models could bridge vision and natural language. Principal component analysis (PCA) and model embedding visualizations displayed that the models could capture both fine-grained and global semantic dimensions represented within the human visual cortex.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2023, November 16). Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training. AZoAi. Retrieved on September 18, 2024 from https://www.azoai.com/news/20231116/Enhanced-Brain-Response-Prediction-to-Real-World-Scenes-Through-Multimodal-Pre-training.aspx.

  • MLA

    Dam, Samudrapom. "Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training". AZoAi. 18 September 2024. <https://www.azoai.com/news/20231116/Enhanced-Brain-Response-Prediction-to-Real-World-Scenes-Through-Multimodal-Pre-training.aspx>.

  • Chicago

    Dam, Samudrapom. "Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training". AZoAi. https://www.azoai.com/news/20231116/Enhanced-Brain-Response-Prediction-to-Real-World-Scenes-Through-Multimodal-Pre-training.aspx. (accessed September 18, 2024).

  • Harvard

    Dam, Samudrapom. 2023. Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training. AZoAi, viewed 18 September 2024, https://www.azoai.com/news/20231116/Enhanced-Brain-Response-Prediction-to-Real-World-Scenes-Through-Multimodal-Pre-training.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Deep Learning-based Gangue Sorting for Coal Plants