In a paper published in Ai.meta, researchers decoded speech from brain activity by introducing a model trained with contrastive learning capable of decoding self-supervised representations of perceived speech from non-invasive brain recordings of healthy individuals.
By curating and integrating data from four public datasets, the model exhibited impressive performance, identifying speech segments from magnetoencephalography (MEG) signals with an accuracy of up to 41%. This breakthrough demonstrated the potential to decode language from brain activity non-invasively and without the risks associated with brain surgery.
Background
Traumatic brain injuries, strokes, and neurodegenerative diseases often result in patients losing speech and communication abilities. Brain-computer interfaces (BCIs) offer hope for detecting and restoring communication in these individuals. In recent years, research teams have successfully used BCIs to decode phonemes, speech sounds, hand gestures, and articulatory movements from electrodes implanted in or over the cortex.
For instance, Willett et al. achieved a decoding rate of 90 characters per minute with 94% accuracy from a spinal cord injury patient recorded in the motor cortex during writing sessions. Similarly, Moses et al. achieved a decoding rate of 15.2 words per minute with 74.4% accuracy in an anarthria patient implanted in the sensorimotor cortex. However, these invasive methods involve brain surgery and can be challenging to maintain.
Proposed Method
Problem Formalization: This research aims to decode speech from high-dimensional brain signal time series collected through non-invasive techniques such as magneto-encephalography (MEG) or electroencephalography (EEG). Healthy volunteers listened to spoken sentences in their native language to obtain the brain recordings. Since how the brain represents spoken words is poorly understood, the decoders are trained in a supervised manner to predict a latent representation of speech relevant to the brain. For example, this latent representation could be the Mel spectrogram, a commonly employed choice for neural decoding.
Model Overview: The proposed model involves brain and speech modules. The brain module takes raw M/EEG time series data as input and uses a spatial attention layer to map its data to a higher-dimensional space. A subject-specific 1x1 convolution is applied to leverage inter-subject variability, followed by a stack of convolutional blocks. Researchers achieve this alignment using a contrastive loss.
Speech Module: The spatial attention layer remaps the brain data onto 270 channels based on the sensor locations. The "Deep Mel" module learns to extract both speech and M/EEG representations simultaneously in an end-to-end manner. Alternatively, the model can rely on representations of speech known by an independent self-supervised speech model called Waveform-to-Vector (wav2vec) 2.0. Researchers have pre-trained this model on a large audio dataset and have demonstrated its effective encoding of various linguistic features. The "wav2vec 2.0" model is more efficient in practice, and this research focuses on leveraging its capabilities for the decoding task.
In essence, this research aims to develop a model that can effectively decode speech from non-invasive brain recordings by aligning brain activity with representations of speech, using a contrastive loss for training and benefiting from the pre-trained speech model "wav2vec 2.0" to enhance the decoding process. The goal is to improve the understanding of how the brain processes and represents speech, opening the door to potential applications in communication for patients with speech-related conditions.
Experimental Results
In this study, the researchers presented the results of their investigation, which primarily revolved around accurately decoding speech from non-invasive brain recordings, specifically M/EEG recordings. Their model demonstrated impressive performance, predicting the correct speech segment out of over 1,000 possibilities with an average top-10 accuracy of up to 70.7% across MEG subjects. Furthermore, they achieved top-1 accuracy rates of up to 41.3%. The study also compared MEG and EEG datasets, indicating a significant difference in decoding performance, with MEG consistently outperforming EEG. They tried to standardize the data, showing that this difference likely results from the recording device.
The research addressed the issue of extracting meaningful signals from noisy brain recordings, demonstrating that their end-to-end architecture required minimal preprocessing of M/EEG data. Additionally, their findings highlighted the importance of the subject-specific layer in improving decoding performance. The results showed that increasing the number of subjects used to train the model improved decoding performance.
Another critical aspect of their study focused on determining the most suitable features for decoding speech representations from brain signals. They employed a speech module pre-trained on a vast corpus of speech sounds, which yielded high-level features for effective decoding. This contrastive learning approach outperformed supervised decoding models, such as those targeting the Mel spectrogram, and emphasized the importance of targeting latent representations of speech. Their analyses suggested that the model primarily captured lexical and contextual features in word embeddings and language models.
Conclusion
To sum up, this research showcases significant progress in decoding speech from non-invasive brain recordings, primarily focusing on M/EEG data. The developed model achieved impressive accuracy in predicting speech segments, emphasizing the advantages of an end-to-end architecture and the importance of subject-specific layers. By pre-training a speech module on extensive speech sound data and employing contrastive learning, the study revealed the significance of latent representations for decoding high-level features. The findings open doors to promising applications, particularly for interpreting intended communication, while emphasizing the need for device-specific considerations.