Depression Detection in Facial Videos with Deep Learning

In an article published in the journal Electronic Letters, the authors explored automatic depression recognition from facial videos using deep learning models. They offered objectivity, speed, and reliability in assessing patients' mental health by using different face alignment methods, data augmentation, better optimization, as well as scheduling techniques.

Study: Enhancing Depression Detection in Facial Videos with Deep Learning. Image credit: Generated using DALL.E.3
Study: Enhancing Depression Detection in Facial Videos with Deep Learning. Image credit: Generated using DALL.E.3

Background

Persistent clinical depression can result in significant complications, affecting both mental and physical well-being. Numerous studies have linked depression to the onset of various health conditions, including cardiovascular disease, osteoporosis, accelerated aging, cognitive impairments, Alzheimer's disease, and other forms of dementia, as well as an elevated risk of premature mortality. Previously, researchers in the field of computer vision had developed methods for Automatic Depression Detection (ADD), which relies on facial expression analysis (FEA) in both images and videos.

Some approaches use deep learning models to extract features from faces and classify them as showing signs of depression or not. More recently, there has been a shift towards leveraging spatio-temporal (ST) information from videos, with the use of 3D-Convolutional Neural Network (3D-CNN) architectures and temporal pooling techniques to capture dynamic information and train 2D-CNNs for depression detection. While novel architectures have improved accuracy in depression recognition, most of the previous studies did not explore or discuss important aspects like face alignment, preprocessing, scheduling, or optimization. The authors of the present study focus on these often overlooked aspects of the machine learning process.

Proposed Methodology

Deep learning models have been widely applied in video-based depression detection. It is observed that the diversity of preprocessing, data augmentation, and optimization techniques makes it difficult to fairly compare model architectures.

The present study introduced a 2D-CNN model that is constructed using the ResNet-50 architecture, leveraging static textural features from video frames. The authors implemented novel training optimization along with scheduling schemes to improve performance and a combined scoring method to estimate the severity of depression based on complementary textural-based models.

Face Alignment: The authors emphasized the importance of correctly aligning and preprocessing the facial regions from each frame of a video. The face detector used is a Multi-task Cascade Convolutional Network (MTCNN).

Whereas normal preprocessing techniques follow the sequence of cropping the image, realigning, and then rescaling the image, the sequence directly affects the resultant image. The authors, however, first aligned and then rescaled the image. The two face alignment techniques introduced are pose-dependent and pose-independent alignments, each offering unique benefits. The goal is to preserve textural information within the facial boundaries while ensuring that both techniques contribute positively to depression recognition.

Model Architecture: The paper employs a ResNet-50 architecture, which acts as the backbone, enhanced with fully connected layers that include 512 neurons, and a regression layer containing 128 neurons to estimate depression levels. The architecture is initialized with pre-trained weights so that it can utilize the pre-learned features. All layers remain unfrozen to facilitate learning of low-level textural features. The final prediction is the mean of all the predictions of individual images.

Data Augmentation: The authors applied data augmentation techniques to enhance the complementary information provided by pose-dependent alignment and pose-independent alignment. They utilized random horizontal flips along with brightness, contrast, and saturation changes, avoiding vertical flips and additional images to maintain data integrity; unlike previous papers, no vertical flip was implemented, and no extra images were added.

Training: The models are trained on the Audio/Visual Emotion Challenge AVEC2013 and AVEC2014 databases. The RAdam optimizer enhanced with Lookahead optimization was employed, dynamically reducing the learning rate when improvements cease.

Experimentation

Datasets: The proposed approach’s performance was evaluated on two benchmark datasets, AVEC2013 and AVEC2014. There were 50 videos per partition, and there were three partitions: Training, Development, and Testing.

Experimental Setup: The evaluation is based only on static texture features extracted from the two benchmark dataset videos. Different results across models are compared against state-of-the-art models that rely on visual data, including static and ST models.

Protocol and Performance Metrics: Standard performance metrics, including Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), were used to assess the models' accuracy. The final score is the average of scores for all the frames.

Error Distribution: The analysis of error distribution, emphasizing the overall reliability of the proposed approach, shows a relatively small overall error, with a majority of videos displaying minimal error.

Comparison with State-of-the-Art: A comparison of the results with state-of-the-art approaches highlighted the competitiveness of the textural-based models. It demonstrated that the approach, which leverages only static information and a well-known deep learning architecture, can achieve results on par with more complex ST models.

Conclusion

To summarize, the authors comprehensively explored a straightforward yet effective approach to depression recognition from facial videos. They showcased the significance of preprocessing and scheduling choices and demonstrated that these aspects may have a substantial impact on model performance, potentially overshadowing the contributions of different network architectures. The findings underline the need for further systematic investigations to distinguish the impact of novel architectures in depression recognition.

Journal reference:

Article Revisions

  • Oct 26 2023 - "Enhancing Depression Detection in Facial Videos with Deep Learning" to "Depression Detection in Facial Videos with Deep Learning"
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2023, October 25). Depression Detection in Facial Videos with Deep Learning. AZoAi. Retrieved on January 02, 2025 from https://www.azoai.com/news/20231025/Depression-Detection-in-Facial-Videos-with-Deep-Learning.aspx.

  • MLA

    Nandi, Soham. "Depression Detection in Facial Videos with Deep Learning". AZoAi. 02 January 2025. <https://www.azoai.com/news/20231025/Depression-Detection-in-Facial-Videos-with-Deep-Learning.aspx>.

  • Chicago

    Nandi, Soham. "Depression Detection in Facial Videos with Deep Learning". AZoAi. https://www.azoai.com/news/20231025/Depression-Detection-in-Facial-Videos-with-Deep-Learning.aspx. (accessed January 02, 2025).

  • Harvard

    Nandi, Soham. 2023. Depression Detection in Facial Videos with Deep Learning. AZoAi, viewed 02 January 2025, https://www.azoai.com/news/20231025/Depression-Detection-in-Facial-Videos-with-Deep-Learning.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Deep Learning Secures IoT with Federated Learning