Enhancing Speech Emotion Recognition: A Dual-Channel Spectrogram Approach

A study published in the journal Information Sciences introduces a novel framework for speech emotion recognition using dual-channel spectrograms and optimized deep features. Their proposed methodology leverages complementary spectrograms and deep neural networks to recognize emotions from speech samples accurately.

Study: Enhancing Speech Emotion Recognition: A Dual-Channel Spectrogram Approach. Image credit: metamorworks/Shutterstock
Study: Enhancing Speech Emotion Recognition: A Dual-Channel Spectrogram Approach. Image credit: metamorworks/Shutterstock

Speech emotion recognition has diverse applications in human-computer interaction, healthcare, education, automotive systems, and beyond. Identifying emotional states from vocal cues is an active area of research. Most existing methods use the Mel spectrogram, which focuses on low-frequency components. This study argues that high-frequencies also contain valuable affective information.

The authors propose using two spectrograms - the Mel spectrogram for low frequencies and a new VTMel (VMD-Teager-Mel) spectrogram for high frequencies. Deep learning is employed to extract optimized features from both spectrograms. The dual-channel architecture aims to achieve highly robust emotion recognition by fusing complementary information from the two spectrograms.

Key Aspects of the Study

The study introduced a VTMel spectrogram incorporating variational mode decomposition (VMD) and Teager energy to emphasize high-frequency components related to emotions. VMD decomposes the speech into sub-bands adaptively to capture modulation spectra details. Teager energy enhances the contrast between high and low frequencies in the spectrogram. The VTMel spectrogram complements the Mel spectrogram by focusing on the high-frequency affective components.

A convolutional neural network (CNN) extracted preliminary features from the Mel and VTMel spectrograms. A deep-restricted Boltzmann machine (DBM) optimized these features by reducing redundancy and dimensionality. The CNN-DBM network provided robust deep features from each spectrogram.

The optimized Mel and VTMel features were concatenated and classified using support vector machines (SVM) and long short-term memory (LSTM) networks. The dual-channel structure leveraged complementary information from both spectrograms for accurate emotion recognition.

Experiments and Results

The proposed method was evaluated on three benchmark datasets - EMO-DB (German), SAVEE (English), and RAVDESS (English). Over 20 emotions were covered across nearly 3000 speech samples in these datasets. Extensive experiments were performed under speaker-dependent, speaker-independent, and gender-dependent scenarios.

The VTMel spectrogram provided significant gains over the Mel spectrogram for recognizing low-arousal emotions like sadness and disgust. Fusing both spectrograms improved accuracy for all emotions uniformly. The proposed CNN-DBM architecture outperformed using just CNN features for spectrogram representation learning.

The optimized dual-channel features achieved the best performance of 96.27% weighted accuracy on EMO-DB using LSTM networks. Comparable emotion recognition results were also obtained on the other datasets and experimental settings. The approach also surpassed 14 recent methods from the literature across the diverse evaluation criteria.

Implications and Conclusions

This study makes multiple contributions towards advancing speech emotion recognition research via the following: a new VTMel spectrogram, an optimized CNN-DBM feature extractor, and a dual-channel fusion framework.

The insights provided in this work can inform the design of robust, affective computing systems for diverse human-machine interaction applications. The proposed techniques circumvent the limitations of Mel-spectrogram-based methods by providing complementary high-frequency information.
Conclusion and Future Outlook

The study proposed a novel VTMel spectrogram to complement the Mel spectrogram by concentrating on high-frequency affective speech components. Deep CNN-DBM networks were leveraged to extract optimized and low-redundancy features from the Mel and VTMel spectrograms. A dual-channel architecture was used to fuse the complementary information from both spectrograms for robust emotion recognition. A significant increase in performance was demonstrated on three benchmark datasets under diverse experimental scenarios, surpassing recent state-of-the-art methods. The techniques advance over standard Mel-spectrogram-based emotion recognition using spectrogram representations.

The successful demonstration of optimized spectrogram-based emotion recognition opens up promising research avenues. While evaluating the techniques on more naturalistic and noisy speech can better establish their real-world viability, developing new spectrogram representations to capture affect-salient acoustic information can also be worthwhile. Moreover, multi-modal emotion recognition combining speech, visuals, text, or physiological signals offers new possibilities. Exploring novel network architectures and self-supervised training strategies can also impact the deep learning front, although advancing the state-of-the-art towards human-level emotion recognition from voice remains an open challenge. The study provides a solid foundation to build upon using complementary spectrograms and deep feature learning.

Future work could explore new spectrogram representations, deep network architectures, and multi-modal emotion recognition frameworks. More evaluations on naturalistic and noisy speech can assess real-world viability. Overall, the paper presents promising techniques to encode emotion-specific low and high-frequency details for automatic recognition from speech.

Journal reference:
Aryaman Pattnayak

Written by

Aryaman Pattnayak

Aryaman Pattnayak is a Tech writer based in Bhubaneswar, India. His academic background is in Computer Science and Engineering. Aryaman is passionate about leveraging technology for innovation and has a keen interest in Artificial Intelligence, Machine Learning, and Data Science.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Pattnayak, Aryaman. (2023, September 10). Enhancing Speech Emotion Recognition: A Dual-Channel Spectrogram Approach. AZoAi. Retrieved on September 18, 2024 from https://www.azoai.com/news/20230910/Enhancing-Speech-Emotion-Recognition-A-Dual-Channel-Spectrogram-Approach.aspx.

  • MLA

    Pattnayak, Aryaman. "Enhancing Speech Emotion Recognition: A Dual-Channel Spectrogram Approach". AZoAi. 18 September 2024. <https://www.azoai.com/news/20230910/Enhancing-Speech-Emotion-Recognition-A-Dual-Channel-Spectrogram-Approach.aspx>.

  • Chicago

    Pattnayak, Aryaman. "Enhancing Speech Emotion Recognition: A Dual-Channel Spectrogram Approach". AZoAi. https://www.azoai.com/news/20230910/Enhancing-Speech-Emotion-Recognition-A-Dual-Channel-Spectrogram-Approach.aspx. (accessed September 18, 2024).

  • Harvard

    Pattnayak, Aryaman. 2023. Enhancing Speech Emotion Recognition: A Dual-Channel Spectrogram Approach. AZoAi, viewed 18 September 2024, https://www.azoai.com/news/20230910/Enhancing-Speech-Emotion-Recognition-A-Dual-Channel-Spectrogram-Approach.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Deep Learning Boosts Security In Virtual Networks By Tackling Complex Intrusion Detection Challenges