In a recent submission to the ArXiV*, researchers from Ohio State University and the University of Texas, USA, aimed to bridge the gap between artificial and human vision and to pave the way for more brain-like artificial intelligence systems.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
They developed a novel vision model called ReAInet that aligns with human brain activity based on non-invasive electroencephalography (EEG) recordings, demonstrating a significantly higher similarity to human brain representations than existing models. This model increased adversarial robustness and individual variability across different layers, reflecting the complexity and adaptability of human visual processing.
Background
An object detection model is a computer vision system that identifies and classifies objects within an image or video. Using techniques like deep learning, popular models include Faster region-convolutional neural network (Faster RCNN), only look once version (YOLO), and single shot multi-box detector (SSD). These models are crucial for applications like autonomous vehicles, surveillance, and image analysis.
In artificial intelligence, substantial progress has been achieved so far. However, existing object recognition models lag in replicating the complex mechanisms of visual information processing observed in human brains. Recent research has highlighted the promise of utilizing neural data to simulate brain processing; however, it relies heavily on invasive neural recordings from non-human subjects.
About the Research
In the present paper, the authors used a core object recognition network with a simple architecture (CORnet-S) state-of-the-art vision model as the foundational architecture for ReAlnet. CORnet-S is a recurrent convolutional neural network that mimics the hierarchical structure of the ventral visual stream in the brain, which is responsible for object recognition.
The study added an EEG generation module to CORnet-S consisting of a series of encoders that transform the latent features from each visual layer of the model into predicted EEG signals. These signals are a type of human neural data that measures the electrical activity of the brain using electrodes attached to the scalp. EEG signals can reflect the brain’s response to different visual stimuli, such as images of objects. The authors used EEG signals to align their vision model with human brain representations to achieve more human brain-like vision.
The model was trained to minimize both the classification loss on ImageNet labels and the generation loss between the predicted and real EEG signals, using a large and rich EEG dataset (THINGS EEG2) that recorded human brain activity while viewing images of objects from different categories. The researchers also used another EEG dataset, THINGS EEG1, and a functional magnetic resonance imaging (fMRI) dataset, Shen fMRI, to estimate the model’s similarity to human brain representations across different modalities, subjects, and images.
The THINGS EEG2 dataset contains human EEG responses from 10 subjects to 22,248 images from 1,854 object concepts. Similarly, THINGS EEG1 comprises responses from 50 subjects and 4,320 images from 720 object concepts. Shen fMRI dataset contains human fMRI responses from 3 subjects to 40 images from different categories.
The study evaluated the performance of ReAlnet on several aspects, including the similarity to human EEG and fMRI, the individual variability across subjects and layers, and the adversarial robustness against white-box attacks. They compared ReAlnet with CORnet-S and other baseline models, such as residual network 101 (ResNet-101) and contrastive language-image pre-training (CLIP).
Research Findings
The outcomes showed that ReAlnet achieved a significantly higher similarity to human EEG neural dynamics for all four visual layers than CORnet-S and also outperformed other baseline models, including ResNet-101 and CLIP. This similarity was consistent across different EEG and fMRI datasets, indicating that ReAlnet learned general and cross-modal brain representation patterns.
The designed model highlighted a higher similarity to human fMRI activity across different brain regions, even though it was not trained with fMRI data, suggesting that ReAlnet learned more general neural representations of the human brain. It exhibited hierarchical individual variabilities across different layers, reflecting the increasing complexity and diversity of neural representations in the human brain. Furthermore, it demonstrated increased adversarial robustness compared to CORnet-S, indicating that aligning with human neural representations can improve the model’s stability and generalization.
The developed techniques have several implications for both computer vision and cognitive neuroscience. For computer vision, the newly presented model represents a novel and effective approach to enhance the resemblance between vision models and the human brain, which can potentially improve the robustness and generalization of the models as well as enable more brain-like artificial intelligence systems. For cognitive neuroscience, the new method can serve as a useful tool to explore the mechanisms of human visual processing and test hypotheses and predictions about the brain’s representational patterns.
Conclusion
In summary, the presented novel model is an effective, efficient, and adaptable framework for human neural representational alignment, along with the corresponding human brain-like model, ReAlnet. It not only aligns closely with human EEG and fMRI but also exhibits hierarchical individual variabilities and increased adversarial robustness, mirroring human visual processing. Moreover, this research effectively fills the gap between human vision and artificial vision.
The study demonstrated that the developed technique can be extended to fMRI and MEG neural modalities and other tasks including natural language and auditory processing utilizing unsupervised or self-supervised models. The researchers acknowledged challenges and limitations, such as the small size of neural datasets and the lack of shared labels among different datasets.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.