Advancing Emotional Speech Recognition Using Deep Learning

In a paper published in the journal Applied Acoustics, researchers present the significance of human emotions in communication and introduce speech emotion recognition (SER), which predicts a speaker's emotional tone through audio signals. The SER system utilizes spectral and prosodic features, such as Mel-frequency Cepstral coefficients (MFCC) and pitch, to identify the speaker's emotional state and even differentiate gender. The proposed SER system outperforms the existing system with an average accuracy of 78% on test data.

Background

Emotions play a vital role in human lives, influencing daily activities and interactions. They are expressed through speech, facial expressions, and gestures. SER involves analyzing vocal behavior, as emotions often manifest in physiological responses that affect how a person speaks. For instance, anger can change breathing, muscle tension, vocal folds, and speech properties. While facial expressions have received more attention, audio emotion expression remains relatively unexplored. However, recent contributions have increased interest in this field, offering diverse applications such as emotion recognition for individuals with disabilities, improved call center interactions, and adaptive e-learning platforms.

Developing SER systems presents challenges, including quantifying emotions and mimicking human behavior. Researchers employ machine learning techniques to create classification-based systems, using features such as pitch, energy, frequency, and modulation spectral features. Feature extraction and selection are crucial to reduce redundancy and improve system performance. Classification methods such as neural networks, Gaussian mixture models, support vector machines, and recurrent neural networks have been proposed.

SER has relevance in various domains, including human-computer interaction, Internet of Things, and artificial intelligence applications. For example, speech recognition integrated with emotional cues can enhance self-driving cars' safety features or enable effective call routing in call centers based on caller emotions. SER also finds use in lie-detection systems and humanoids, aiding in human-like interactions.

The article outlines a proposed SER system for feature extraction using convolutional neural networks (CNN) and Mel-frequency cepstral coefficients (MFCC). Implementation details, results, and future enhancement possibilities are discussed, highlighting the potential of CNN-based models in achieving SER.

Study: Advancing Emotional Speech Recognition Using Deep Learning. Image credit: TierneyMJ / Shurtterstock
Study: Advancing Emotional Speech Recognition Using Deep Learning. Image credit: TierneyMJ /Shurtterstock

Literature review

Speech emotion recognition faces challenges in binary classification, hindering the accurate identification and interpretation of speech-related gestures. Cao et al. developed a ranking SVM strategy to overcome this difficulty that synthesizes statistics to recognize emotions. They treated facts from each speaker as individual questions and trained SVM algorithms for different emotions, achieving notable improvements in accuracy compared to traditional SVMs.

Arias et al. proposed a shape-based method for recognizing emotional salience using principal frequency. They employed functional data analysis and PCA to capture the natural variability of F0 contours, resulting in a 75.8% binary category accuracy.

Grimm et al. presented a multi-dimensional model using emotion primitives, combining valence, activation, and dominance. They utilized a text-free, image-based method and extracted acoustic features to achieve an overall popularity rating of 83.5%.

Nwe et al. introduced a device for speech emotion classification using discrete HMM and LFPC. They divided emotions into six categories and achieved accuracy rates of 78% and 96% for different types of emotion identification.

Other approaches include ensemble random forest trees, Gaussian mixture aggregate vector auto-regressive (GMVAR), computational techniques for emotion recognition in vocal social media, fusion-based methods, and modulation spectral functions. These methods have shown advancements in categorizing emotional speech and demonstrated promising results on different datasets.

Existing speech emotion recognition systems have drawbacks such as long pre-processing steps, difficulty handling variable-length audio files, high cost, static nature, and poor performance in real-world scenarios. Future research aims to address these limitations and develop more robust and adaptable systems.

Proposed SER system and its implementation

This study aims to develop an efficient SER system capable of accurately identifying human emotions from audio signals. The system employs a CNN algorithm and MFCC feature extraction to achieve this goal. Previous research primarily focused on lexical analysis to classify emotions into anger, joy, and neutral categories, using the correlation between training and testing data as a determining factor. Another approach involved recognizing segments of angry, happy, and neutral emotions using SVM with feature extraction. The proposed system utilizes MFCC as a feature and CNN for classification, balancing computational volume and real-time performance.

The system undergoes stages similar to other machine learning projects, including data collection, pre-processing, combining datasets, feature extraction, model training, and evaluation.

Results and conclusions

The performance of the CNN model on the test dataset is satisfactory. Results in the article indicate that the model performs consistently across all emotion classes. By comparing the accuracy of the proposed SER system using CNN and MFCC feature extraction with the existing system, highlighting better or equally accurate results for most emotions except surprise.

Future improvements may involve hybrid architecture-based models, incorporating new speech emotion data and data augmentation techniques to enhance training performance and accuracy.

Journal reference:
Dr. Sampath Lonka

Written by

Dr. Sampath Lonka

Dr. Sampath Lonka is a scientific writer based in Bangalore, India, with a strong academic background in Mathematics and extensive experience in content writing. He has a Ph.D. in Mathematics from the University of Hyderabad and is deeply passionate about teaching, writing, and research. Sampath enjoys teaching Mathematics, Statistics, and AI to both undergraduate and postgraduate students. What sets him apart is his unique approach to teaching Mathematics through programming, making the subject more engaging and practical for students.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Lonka, Sampath. (2023, July 06). Advancing Emotional Speech Recognition Using Deep Learning. AZoAi. Retrieved on January 22, 2025 from https://www.azoai.com/news/20230705/Advancing-Emotional-Speech-Recognition-Using-Deep-Learning.aspx.

  • MLA

    Lonka, Sampath. "Advancing Emotional Speech Recognition Using Deep Learning". AZoAi. 22 January 2025. <https://www.azoai.com/news/20230705/Advancing-Emotional-Speech-Recognition-Using-Deep-Learning.aspx>.

  • Chicago

    Lonka, Sampath. "Advancing Emotional Speech Recognition Using Deep Learning". AZoAi. https://www.azoai.com/news/20230705/Advancing-Emotional-Speech-Recognition-Using-Deep-Learning.aspx. (accessed January 22, 2025).

  • Harvard

    Lonka, Sampath. 2023. Advancing Emotional Speech Recognition Using Deep Learning. AZoAi, viewed 22 January 2025, https://www.azoai.com/news/20230705/Advancing-Emotional-Speech-Recognition-Using-Deep-Learning.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Hybrid Deep Learning Optimizes Renewable Power Flow