In a paper published in the journal Applied Acoustics, researchers present the significance of human emotions in communication and introduce speech emotion recognition (SER), which predicts a speaker's emotional tone through audio signals. The SER system utilizes spectral and prosodic features, such as Mel-frequency Cepstral coefficients (MFCC) and pitch, to identify the speaker's emotional state and even differentiate gender. The proposed SER system outperforms the existing system with an average accuracy of 78% on test data.
Background
Emotions play a vital role in human lives, influencing daily activities and interactions. They are expressed through speech, facial expressions, and gestures. SER involves analyzing vocal behavior, as emotions often manifest in physiological responses that affect how a person speaks. For instance, anger can change breathing, muscle tension, vocal folds, and speech properties. While facial expressions have received more attention, audio emotion expression remains relatively unexplored. However, recent contributions have increased interest in this field, offering diverse applications such as emotion recognition for individuals with disabilities, improved call center interactions, and adaptive e-learning platforms.
Developing SER systems presents challenges, including quantifying emotions and mimicking human behavior. Researchers employ machine learning techniques to create classification-based systems, using features such as pitch, energy, frequency, and modulation spectral features. Feature extraction and selection are crucial to reduce redundancy and improve system performance. Classification methods such as neural networks, Gaussian mixture models, support vector machines, and recurrent neural networks have been proposed.
SER has relevance in various domains, including human-computer interaction, Internet of Things, and artificial intelligence applications. For example, speech recognition integrated with emotional cues can enhance self-driving cars' safety features or enable effective call routing in call centers based on caller emotions. SER also finds use in lie-detection systems and humanoids, aiding in human-like interactions.
The article outlines a proposed SER system for feature extraction using convolutional neural networks (CNN) and Mel-frequency cepstral coefficients (MFCC). Implementation details, results, and future enhancement possibilities are discussed, highlighting the potential of CNN-based models in achieving SER.
Literature review
Speech emotion recognition faces challenges in binary classification, hindering the accurate identification and interpretation of speech-related gestures. Cao et al. developed a ranking SVM strategy to overcome this difficulty that synthesizes statistics to recognize emotions. They treated facts from each speaker as individual questions and trained SVM algorithms for different emotions, achieving notable improvements in accuracy compared to traditional SVMs.
Arias et al. proposed a shape-based method for recognizing emotional salience using principal frequency. They employed functional data analysis and PCA to capture the natural variability of F0 contours, resulting in a 75.8% binary category accuracy.
Grimm et al. presented a multi-dimensional model using emotion primitives, combining valence, activation, and dominance. They utilized a text-free, image-based method and extracted acoustic features to achieve an overall popularity rating of 83.5%.
Nwe et al. introduced a device for speech emotion classification using discrete HMM and LFPC. They divided emotions into six categories and achieved accuracy rates of 78% and 96% for different types of emotion identification.
Other approaches include ensemble random forest trees, Gaussian mixture aggregate vector auto-regressive (GMVAR), computational techniques for emotion recognition in vocal social media, fusion-based methods, and modulation spectral functions. These methods have shown advancements in categorizing emotional speech and demonstrated promising results on different datasets.
Existing speech emotion recognition systems have drawbacks such as long pre-processing steps, difficulty handling variable-length audio files, high cost, static nature, and poor performance in real-world scenarios. Future research aims to address these limitations and develop more robust and adaptable systems.
Proposed SER system and its implementation
This study aims to develop an efficient SER system capable of accurately identifying human emotions from audio signals. The system employs a CNN algorithm and MFCC feature extraction to achieve this goal. Previous research primarily focused on lexical analysis to classify emotions into anger, joy, and neutral categories, using the correlation between training and testing data as a determining factor. Another approach involved recognizing segments of angry, happy, and neutral emotions using SVM with feature extraction. The proposed system utilizes MFCC as a feature and CNN for classification, balancing computational volume and real-time performance.
The system undergoes stages similar to other machine learning projects, including data collection, pre-processing, combining datasets, feature extraction, model training, and evaluation.
Results and conclusions
The performance of the CNN model on the test dataset is satisfactory. Results in the article indicate that the model performs consistently across all emotion classes. By comparing the accuracy of the proposed SER system using CNN and MFCC feature extraction with the existing system, highlighting better or equally accurate results for most emotions except surprise.
Future improvements may involve hybrid architecture-based models, incorporating new speech emotion data and data augmentation techniques to enhance training performance and accuracy.