In an article published in the journal Sensors, researchers systematically reviewed facial and pose emotion recognition using deep learning and computer vision. Analyzing 77 studies, they categorized methods like convolutional neural network (CNN), Faster region (R)-CNN, and Vision Transformer (ViT), and evaluated their application in psychology, healthcare, and entertainment.
The review highlighted current trends, relevant datasets, and the scope of emotion detection techniques in human-computer interaction, providing insights into state-of-the-art algorithms and their components.
Background
Machine learning enables machines to learn from data without explicit programming, while deep learning, a subset of machine learning, utilizes artificial neural networks to tackle complex problems. Emotion recognition, which involves interpreting human emotions through verbal and non-verbal cues, has become a critical application of these technologies.
Traditional methods, such as the facial action coding system (FACS), face challenges in real-world scenarios and lack consistency in predicting emotional states. Recent deep learning methods, especially CNNs, have enhanced the accuracy and efficiency of emotion recognition.This review aimed to bridge gaps in existing research by systematically analyzing studies on emotion recognition using deep learning, focusing on both facial and body expressions, and evaluating different deep learning architectures and datasets. This comprehensive analysis was intended to guide future research and application in various fields, including healthcare, education, and entertainment.
Systematic Review Methodology
This study followed a structured process divided into three phases: planning the review, conducting the review, and reporting the results. In the planning phase, the necessity for this systematic review was established, and research questions were formulated to guide the review protocol. During the conducting phase, a comprehensive validation process was performed, followed by the development of a search strategy to identify relevant research studies.
The initial search yielded 522 studies from databases such as Scopus, Institute of Electrical and Electronics Engineers (IEEE) Xplore, Association for Computing Machinery (ACM) Digital Library, and Web of Science. After removing duplicates and non-indexed articles, 122 studies were selected for further analysis based on titles and abstracts, narrowing it down to 85 articles for full-text quality assessment.
Eligibility criteria included empirical studies on facial or body pose emotion detection using deep learning and computer vision while excluding studies focused on verbal emotion detection or unrelated deep learning contexts. No time constraints were applied, considering all studies published before December 2023. The search strategy involved complex boolean expressions to cover a wide range of relevant studies.
A rigorous quality assessment was conducted on the 85 potential studies using a set of nine questions, with responses categorized as yes (1), partially (0.5), or no (0). This assessment helped to filter out less relevant studies, resulting in 77 studies deemed relevant for detailed analysis. The assessment scores categorized studies into high, medium, and low relevance, with only 8 studies being excluded due to low scores.
Results and Analysis
The researchers summarized the findings from a literature review on emotion detection, structured around a taxonomy that included study scope, test environments, methodologies, and datasets. The scope covered facial macro-expressions, micro-expressions, and body expressions. Facial macro-expressions are voluntary and easily observable, lasting from 0.75 to two seconds.
Researchers utilized CNN-based models like face-sensitive (FS)-CNN and deep facial expression vector extractor (DeepFEVER), achieving high accuracies on datasets like large-scale celebrity face attributes (CelebA) and AffectNet. Two-branch CNN models and techniques such as image processing and adaptive feature mapping were employed to handle challenges like face masks and low-light conditions.Facial micro-expressions were involuntary and fleeting, revealing true emotions.
Methods like ViT for feature extraction and support vector machines (SVM) for classification achieved high accuracies despite the challenge of fleeting expressions. Data augmentation, pre-processing, and lightweight CNN architectures further enhanced detection accuracy, particularly in datasets like the Chinese Academy of Sciences micro-expression (CASME) and CASME II.
Gesture expressions involved dynamic body movements and static poses captured in single frames. Deep learning methods such as three-dimensional (3D)-CNNs and CNNs with long short-term memory (LSTM) were used for feature extraction and classification. Multimodal datasets incorporating facial, body gestures, voice, and physiological signals enabled comprehensive emotion recognition, achieving an accuracy of over 83%.
The taxonomy distinguished between real-world and controlled test environments to understand the adaptability of deep learning methods. Methodologies included CNNs, Faster R-CNN, and ViT, with CNNs being widely utilized. Datasets like AffectNet, facial expression recognition (FER)-2013, extended Cohn-Kanade (CK+), and Japanese female facial expression (JAFFE) were commonly used, capturing a range of emotions in various settings.
Discussion and Insights
The field of emotion recognition using deep learning and computer vision has advanced significantly, outperforming traditional methods. Key challenges included limited labeled datasets and handling diverse expressions in real-world contexts. The majority of studies focused on facial macro-expressions (88.3%) and micro-expressions (11.7%).
CNNs dominated the methodologies, with notable use of transfer learning. ViTs showed promise due to their global context analysis. Dataset quality and diversity were critical, with FER-2013 and CK+ being popular. Performance improvement techniques included fine-tuning and batch normalization.
Conclusion
In conclusion, this systematic review highlighted the significant advancements in emotion recognition using deep learning and computer vision, emphasizing the effectiveness of CNNs, Faster R-CNNs, and ViT. Despite substantial progress, challenges such as limited annotated datasets and real-world applicability persisted.
Facial macro-expressions dominated research, while micro-expressions and body gestures showed growing interest due to new deep learning techniques. Future research should address dataset diversity, real-world conditions, and specialized populations.
Journal reference:
- Pereira, R., Mendes, C., Ribeiro, J., Ribeiro, R., Miragaia, R., Rodrigues, N., Costa, N., & Pereira, A. (2024). Systematic Review of Emotion Detection with Computer Vision and Deep Learning. Sensors, 24(11), 3484. https://doi.org/10.3390/s24113484, https://www.mdpi.com/1424-8220/24/11/3484