In an article recently published in the journal Electronics, researchers investigated the effectiveness of deep convolutional generative adversarial network (DCGAN)-based data augmentation technique for improved speech emotion recognition.
Background
Emotional speech recognition is increasingly becoming crucial for various applications in sentiment analysis, customer service, entertainment, healthcare, and smart homes. Initially, studies on emotion recognition in speech primarily focused on using probabilistic models such as Gaussian mixture models (GMMs) and hidden Markov models (HMMs).
However, emotional speech recognition studies using neural networks became prevalent with the advent of deep learning (DL). Although several studies have been performed in this field, accurate emotional speech recognition has remained a significant challenge due to the diversity and complexity of emotions, difficulty in subjective evaluation, and limited datasets/availability of high-quality emotional speech data.
For instance, existing emotional speech datasets, such as Ryerson audio-visual database of emotional speech and song (RAVDESS), EmoDB, and interactive emotional dyadic motion capture (IEmoCAP), are not large-scale datasets as other machine learning datasets such as ImageNet.
The proposed data augmentation approach
In this study, researchers assessed the feasibility of using DCGANs to augment data from the EmoDB and RAVDESS databases. The speech data from the existing emotional speech datasets was augmented in mel-spectrograms.
Although DCGANs are primarily utilized for augmentation of image data, researchers in this study evaluated the application of these GANs to mel-spectrograms, which are speech’s time-frequency representations that capture different emotion components effectively.
Additionally, researchers also investigated the effectiveness of a combined emotional speech recognition model of bidirectional long short-term memory (BiLSTM) and convolutional neural networks (CNN) to identify emotions accurately from the mel-spectrogram data. Although both emotional speech datasets contain several emotional states, researchers primarily focused on six emotional states, including sadness, neutral, happiness, fear, disgust, and anger.
In the data preprocessing stage, envelope detection was used to eliminate redundant and silent segments of the speech data obtained from the EmoDB and RAVDESS datasets. Envelope detection in the librosa package can detect the primary variations in an audio signal and eliminate silence efficiently, retaining only the essential audio data.
Subsequently, the mel-spectrogram function from the librosa package was used to convert the audio data into a mel-spectrogram. The mel-spectrogram was then transformed to a dB scale for more efficient and consistent model training, which reduced the mel-spectrogram dynamic range.
The DCGAN was trained using the mel-spectrograms acquired from the original speech data, and fresh mel-spectrograms were generated by the trained generator in the DCGAN. Researchers used a mini-batch technique during DL training owing to memory limitations and the PyTorch DL framework for model layer construction.
In the DCGAN, a random noise vector in the latent space is received and transformed into image-like data by the generator. Researchers expanded the latent vectors into two-dimensional (2D) tensors using an initial fully connected linear layer and then incrementally enhanced the image resolution using four transposed convolution layers to obtain the final image.
ReLU activation and batch normalization functions were applied after every transposed convolution layer to ensure network stability. The final layer of the generator utilized the tanh activation function to restrict the output within the [–1, 1] range. The discriminator in the DCGAN receives the image data and classifies whether the image is created by the generator or genuine.
It contained four convolutional layers, each incorporating a Leaky ReLU activation function and batch normalization, and the last convolutional layer generated a single value indicating an image's probable authenticity. Subsequently, the sigmoid activation function outputs the probability value within the [0, 1] range.
Eventually, the mel-spectrogram produced using DCGAN and the mel-spectrogram obtained from the original speech data/RAVDESS and EmoDB were combined to create the final dataset, which was then utilized as an input to the CNN- BiLSTM emotional speech recognition model.
Experimental evaluation of the approach
During the experimental evaluation, researchers used both authentic mel-spectrograms obtained from EmoDB and RAVDESS databases and the DCGAN-generated augmented mel-spectrograms combined with the authentic mel-spectrograms of RAVDESS and EmoDB/RAVDESS +augmented and EmoDB + augmented.
The CNN- BiLSTM emotion recognition model was used to evaluate and compare the performance of original data and original + augmented data. RMSprop optimizer was selected to ensure rapid convergence and stable gradient updates. Unweighted accuracy (UA) and weighted accuracy (WA) were utilized as the performance metrics.
Researchers fixed the ratio of categorizing the data into validation, test, and train datasets to 1.5:1.5:7, respectively, to maintain model stability while performing adequate evaluation and training. Additionally, the ReduceLROnPlateau method was employed to adjust the learning rate dynamically to maintain the optimization process stability.
Significance of the study
The experiment results demonstrated that the incorporation of augmented data to the original data significantly improved both UA and WA compared to the UA and WA achieved using only original data in both RAVDESS and EmoDB datasets. For instance, the WA of the RAVDESS dataset and RAVDESS + augmented dataset was 64.8% and 72.3%, respectively, while the UA of the RAVDESS and RAVDESS + augmented data was 64.2% and 72.3%, respectively.
Similarly, the WA of the EmoDB dataset and EmoDB + augmented dataset was 80.6% and 90.4%, respectively, while the UA of the EmoDB dataset and EmoDB + augmented dataset was 82.6% and 91.3%, respectively. To summarize, the findings of this study displayed the feasibility of using the DCGAN-based data augmentation technique to effectively improve the performance of speech emotion recognition models.