In an article published in the journal Future Internet, researchers explored the use of multimodal large language models (MLLMs) for emotion recognition from videos. They investigated MLLMs' performance in a zero-shot learning setting, combining acoustic and visual data modalities.
While MLLMs did not outperform state-of-the-art models on the Hume-Reaction benchmark, they excelled in recognizing emotions with intensity deviations. The authors highlighted new opportunities for enhancing existing emotion recognition systems with MLLM content reformulation.
Background
Video content is increasingly prevalent on platforms like YouTube and TikTok, necessitating efficient content classification and annotation solutions. Recognizing human emotions in videos is crucial for applications in healthcare monitoring, AI chatbots, and gaming.
Traditional approaches to emotion recognition often rely on processing facial expressions and audio streams using transformer architectures. However, these methods require extensive training data and lack cross-modality interactions.
Recent advancements in MLLMs, which combine visual and acoustic data, offer new opportunities.
This study explored MLLMs for emotion recognition, particularly focusing on emotional reaction intensity (ERI) estimation. By applying video-language models like learning united visual representation by alignment before projection (Video-LLaVa) and integrating them with state-of-the-art architectures such as video-based perceiver for emotion recognition (ViPER), the research investigated MLLMs' potential to enhance emotion detection.
Recognizing Emotional Reactions in Videos
The aim of the researchers was to recognize emotional reactions in videos using the Hume-Reaction dataset, a comprehensive collection designed for the Emotional Reactions Sub-Challenge. The dataset included recordings from 2222 subjects, with over 70 hours of video data capturing naturalistic emotional reactions in uncontrolled environments.
Emotions such as adoration, amusement, anxiety, disgust, empathic pain, fear, and surprise were self-annotated by the subjects, who rated the intensity of each emotion on a scale from 0 to 1. This multi-regression problem aimed to predict these intensity scores.
The dataset was divided into three splits, a training set with 15,806 samples, a development set with 4657 samples, and a private test set with 4604 samples. The training and development sets were used for experimentation, while the test set was not utilized due to the lack of publicly available labels.
The authors employed three main methodologies, direct querying of Video-LLaVA, probing networks, and integrating MLLM-generated description features into a transformer-based architecture. The direct querying method involved prompting Video-LLaVA to assign emotion scores. The probing network strategy fine-tuned a small regressor on the embeddings produced by Video-LLaVA.
The third approach integrated textual features from Video-LLaVA-generated descriptions into the ViPER architecture. This integration involved either using a single textual description for the entire video or generating frame-specific descriptions to enhance the alignment of textual and visual information.
Empirical Evaluation and Results
The researchers presented the empirical evaluation of emotion recognition approaches using quantitative and qualitative analyses. The experiments were conducted on a high-performance machine, selecting 32 equidistant frames from each video. The AdamW optimizer and mean squared error (MSE) loss function were used for training.
Quantitative results indicated that direct querying of Video-LLaVA resulted in a lower mean Pearson correlation of 0.0937, suggesting limited effectiveness. Probing Video-LLaVA showed improvements, with correlations of 0.2333 and 0.2351 for two different prompts, showing the potential of probing strategies even though they did not surpass baselines.
Integration of Video-LLaVA textual features into the ViPER-VATF framework yielded competitive results, with correlations of 0.3004 and 0.3011, outperforming classical ViPER-VATF for specific emotions like anxiety and empathic pain. LLaVA frame-specific textual features integration achieved a correlation of 0.2895, slightly lower than Video-LLaVA integration.
The qualitative analysis compared the ViPER-VATF models based on contrastive language-image pre-training (CLIP) and Video-LLaVA. Confusion matrices revealed that the CLIP-based approach concentrated predictions near the average value, while the Video-LLaVA-based approach covered a wider range of values, indicating better detection of emotional outliers. This suggested that combining both approaches could improve the emotion recognition system by balancing robustness in common expressions and sensitivity to extreme emotional intensities.
Conclusion
In conclusion, the researchers demonstrated the potential of MLLMs for enhancing video emotion recognition. While traditional models like ViPER and CLIP required extensive fine-tuning, integrating Video-LLaVA features into ViPER-VATF showed improved performance, especially in detecting extreme emotional responses.
MLLMs offered flexibility and scalability, accommodating new emotional contexts without predefined templates. However, challenges included limited adaptability, potential bias, and interpretability issues. Future work will explore expanding these methods to other scenarios, integrating audio-language LLMs, and refining multimodal approaches to address these limitations and further enhance emotion detection accuracy.
Journal reference:
- Vaiani, L., Cagliero, L., & Garza, P. (2024). Emotion Recognition from Videos Using Multimodal Large Language Models. Future Internet, 16(7), 247. DOI:002010.3390/fi16070247, https://www.mdpi.com/1999-5903/16/7/247