Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training

Download PDF Copy

By Samudrapom DamReviewed by Susha Cheriyedath, M.Sc.Nov 16 2023

In an article recently published in the journal Nature Machine Intelligence, researchers demonstrated that models using natural language feedback and more diverse, larger training sets can better predict brain response to complex, real-world scenes.

*Study: Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training. Image credit: Generated using DALL.E.3*

Background

Advancements in deep learning have enabled deep neural networks that share learned representations and task goals with natural systems to predict brain responses. However, most models utilized for brain response prediction are based on ImageNet pre-training and learn a low-dimensional task objective, such as categorization.

Natural vision has incorporated different language, conceptual, and perceptual sources during its evolution over millions of years and can solve several tasks. Considering such multimodal sources during network training, such as by including complex datasets containing human-relevant information, is a significant challenge for comprehending such biological systems.

Recently, state-of-the-art models have demonstrated substantially improved performance in both language and vision tasks. These advances can be attributed to learning more complex human semantics from several modalities and using more diverse and larger training sets compared to the sets utilized in previous models.

Specifically, language can effectively highlight behaviorally relevant aspects of the world and can draw attention to human semantics in the training data. Similarly, larger training set sizes can provide more and better high-dimensional supervisory signal examples.

The proposed approach

In this study, researchers investigated whether higher-performing models using natural language feedback and more diverse, larger training sets could predict brain response to real-world, complex scenes in a better manner. Specifically, they evaluated and quantified the contributions of large-scale, diverse multimodal pre-training provided using contrastive language-image pre-training (CLIP) to generate semantically grounded representations of natural scenes.

Researchers used models pre-trained using CLIP to represent the class of models leveraging supervision from scene images/vision for language and from image captions/natural language for vision to study visual representations. CLIP learns image embeddings that best match the image caption text embeddings from large, diverse datasets.

Model learning using CLIP is more similar to human visual learning, where the top-down knowledge primarily influences the earliest visual pathway layers. Additionally, CLIP’s image and natural language pre-training and diverse, large training sets can effectively capture the fine-grained visual experience of humans. Thus, CLIP is a suitable model to investigate brain prediction along several dimensions.

Moreover, the impact of various model architectures can be evaluated due to the versatility of the CLIP scheme, while the impact of dataset diversity and size can be explored by performing controlled comparisons using related datasets and models.

Researchers extracted network representations from CLIP-trained neural network models, including ViT_CLIP and ResNet_CLIP, and from multiple single modality models, including ImageNet pre-trained ResNet50 and BERT. Subsequently, voxelwise encoding models based on CLIP image features were developed to predict brain responses generated from viewing the Natural Scenes Dataset (NSD) images.

Several open-source models were selected to directly compare with CLIP based on four factors: data diversity, dataset size, feedback, and architecture. The models were simCLR, a self-supervised model, SLIP, a self-supervised model including language feedback, and many open versions of CLIP.

These models were trained using datasets that included 15 million/400 million/two billion image/caption pairs. Encoding models were then constructed using these networks to explain responses from NSD to accurately assess and quantify the contributions of data diversity, dataset size, pre-training, and architecture.

Significance of the study

Results demonstrated that the brain prediction performance was constantly higher for CLIP compared to other models. Models using CLIP led to encoding models that displayed better performance in predicting high-level visual representations in the human brain compared to single modality models pre-trained using less diverse and smaller datasets.

ResNet50 with CLIP explained up to 79% of the variance in individual voxel responses in the held-out test data, which was significantly higher compared to models trained using only text/BERT or image/label pairs/ ImageNet trained ResNet. Comparative evaluation of different model backbones confirmed that network architecture had no significant role in improving the performance of models using CLIP.

However, comparisons across models controlled for data diversity and dataset size demonstrated that natural language feedback and data diversity in larger datasets were important factors in explaining neural responses in high-level visual brain regions.

The improvements were attributed primarily to data diversity and joint image/caption training provided by this data beyond a specific training dataset size. Improvement with language feedback was observed even when dataset factors were controlled.

Moreover, visual brain responses could also be predicted successfully using only image captions, indicating that CLIP models could bridge vision and natural language. Principal component analysis (PCA) and model embedding visualizations displayed that the models could capture both fine-grained and global semantic dimensions represented within the human visual cortex.

Journal reference:

Wang, A. Y., Kay, K., Naselaris, T., Tarr, M. J., Wehbe, L. (2023). Better human high-level visual cortex models emerge from natural language supervision with a large and diverse dataset. Nature Machine Intelligence, 1-12. https://doi.org/10.1038/s42256-023-00753-y, https://www.nature.com/articles/s42256-023-00753-y

Posted in: AI Research News

Comments (0)

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Dam, Samudrapom. (2023, November 16). Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training. AZoAi. Retrieved on April 03, 2025 from https://www.azoai.com/news/20231116/Enhanced-Brain-Response-Prediction-to-Real-World-Scenes-Through-Multimodal-Pre-training.aspx.
MLA
Dam, Samudrapom. "Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training". AZoAi. 03 April 2025. <https://www.azoai.com/news/20231116/Enhanced-Brain-Response-Prediction-to-Real-World-Scenes-Through-Multimodal-Pre-training.aspx>.
Chicago
Dam, Samudrapom. "Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training". AZoAi. https://www.azoai.com/news/20231116/Enhanced-Brain-Response-Prediction-to-Real-World-Scenes-Through-Multimodal-Pre-training.aspx. (accessed April 03, 2025).
Harvard
Dam, Samudrapom. 2023. Enhanced Brain Response Prediction to Real-World Scenes Through Multimodal Pre-training. AZoAi, viewed 03 April 2025, https://www.azoai.com/news/20231116/Enhanced-Brain-Response-Prediction-to-Real-World-Scenes-Through-Multimodal-Pre-training.aspx.