A study published in the journal Information Sciences introduces a novel framework for speech emotion recognition using dual-channel spectrograms and optimized deep features. Their proposed methodology leverages complementary spectrograms and deep neural networks to recognize emotions from speech samples accurately.
Speech emotion recognition has diverse applications in human-computer interaction, healthcare, education, automotive systems, and beyond. Identifying emotional states from vocal cues is an active area of research. Most existing methods use the Mel spectrogram, which focuses on low-frequency components. This study argues that high-frequencies also contain valuable affective information.
The authors propose using two spectrograms - the Mel spectrogram for low frequencies and a new VTMel (VMD-Teager-Mel) spectrogram for high frequencies. Deep learning is employed to extract optimized features from both spectrograms. The dual-channel architecture aims to achieve highly robust emotion recognition by fusing complementary information from the two spectrograms.
Key Aspects of the Study
The study introduced a VTMel spectrogram incorporating variational mode decomposition (VMD) and Teager energy to emphasize high-frequency components related to emotions. VMD decomposes the speech into sub-bands adaptively to capture modulation spectra details. Teager energy enhances the contrast between high and low frequencies in the spectrogram. The VTMel spectrogram complements the Mel spectrogram by focusing on the high-frequency affective components.
A convolutional neural network (CNN) extracted preliminary features from the Mel and VTMel spectrograms. A deep-restricted Boltzmann machine (DBM) optimized these features by reducing redundancy and dimensionality. The CNN-DBM network provided robust deep features from each spectrogram.
The optimized Mel and VTMel features were concatenated and classified using support vector machines (SVM) and long short-term memory (LSTM) networks. The dual-channel structure leveraged complementary information from both spectrograms for accurate emotion recognition.
Experiments and Results
The proposed method was evaluated on three benchmark datasets - EMO-DB (German), SAVEE (English), and RAVDESS (English). Over 20 emotions were covered across nearly 3000 speech samples in these datasets. Extensive experiments were performed under speaker-dependent, speaker-independent, and gender-dependent scenarios.
The VTMel spectrogram provided significant gains over the Mel spectrogram for recognizing low-arousal emotions like sadness and disgust. Fusing both spectrograms improved accuracy for all emotions uniformly. The proposed CNN-DBM architecture outperformed using just CNN features for spectrogram representation learning.
The optimized dual-channel features achieved the best performance of 96.27% weighted accuracy on EMO-DB using LSTM networks. Comparable emotion recognition results were also obtained on the other datasets and experimental settings. The approach also surpassed 14 recent methods from the literature across the diverse evaluation criteria.
Implications and Conclusions
This study makes multiple contributions towards advancing speech emotion recognition research via the following: a new VTMel spectrogram, an optimized CNN-DBM feature extractor, and a dual-channel fusion framework.
The insights provided in this work can inform the design of robust, affective computing systems for diverse human-machine interaction applications. The proposed techniques circumvent the limitations of Mel-spectrogram-based methods by providing complementary high-frequency information.
Conclusion and Future Outlook
The study proposed a novel VTMel spectrogram to complement the Mel spectrogram by concentrating on high-frequency affective speech components. Deep CNN-DBM networks were leveraged to extract optimized and low-redundancy features from the Mel and VTMel spectrograms. A dual-channel architecture was used to fuse the complementary information from both spectrograms for robust emotion recognition. A significant increase in performance was demonstrated on three benchmark datasets under diverse experimental scenarios, surpassing recent state-of-the-art methods. The techniques advance over standard Mel-spectrogram-based emotion recognition using spectrogram representations.
The successful demonstration of optimized spectrogram-based emotion recognition opens up promising research avenues. While evaluating the techniques on more naturalistic and noisy speech can better establish their real-world viability, developing new spectrogram representations to capture affect-salient acoustic information can also be worthwhile. Moreover, multi-modal emotion recognition combining speech, visuals, text, or physiological signals offers new possibilities. Exploring novel network architectures and self-supervised training strategies can also impact the deep learning front, although advancing the state-of-the-art towards human-level emotion recognition from voice remains an open challenge. The study provides a solid foundation to build upon using complementary spectrograms and deep feature learning.
Future work could explore new spectrogram representations, deep network architectures, and multi-modal emotion recognition frameworks. More evaluations on naturalistic and noisy speech can assess real-world viability. Overall, the paper presents promising techniques to encode emotion-specific low and high-frequency details for automatic recognition from speech.