In a paper published in the journal Sensors, researchers proposed the fine-tuned channel-spatial attention transformer (FT-CSAT) model to address facial expression recognition (FER) challenges, including facial occlusion and head pose changes. FT-CSAT achieves state-of-the-art recognition accuracy on benchmark datasets, confirming its robustness in handling facial occlusion and pose changes.
Background
Facial expressions are crucial in human communication, directly conveying inner emotions. The importance of FER extends to various fields, such as autonomous driving, human-computer interaction, and healthcare, driving increased research interest. In human-computer interactions, facial expression information enhances machine responses and interaction experiences. In driver fatigue monitoring, FER detects the driver's mental state for safe operation and timely reminders. In medical diagnosis, recognizing facial expressions supports emotional state analysis in patients. However, automatic FER faces challenges due to individual differences and real-life issues such as pose changes and occlusion.
Researchers have investigated using transformers for expression recognition and introduced different modules such as attention-selective fusion, squeeze extraction attention, and locally optimized SWin transformers. Each of these modules has shown significant accuracy improvements on various datasets. The CSWin Transformer, which utilizes cross-shaped window self-attention (CSWSA) and locally enhanced positional encoding, has achieved state-of-the-art performance in different vision tasks while remaining computationally efficient.
Previous work
Since 2012, traditional Convolutional Neural Network (CNN) architectures such as AlexNet, Visual Geometry Group-Net (VGGNet), GoogLeNet, and residual-net (ResNet) have demonstrated remarkable achievements in image recognition tasks within the domain of FER. Various approaches have been proposed to enhance accuracy, such as improved ResNet, VGG-based expression recognition, and residual expression feature learning using GANs.
Despite the significant advancements made by CNNs, they face limitations when it comes to capturing long-range relationships between various facial regions due to their dependence on local operations within neighborhoods. Google introduced the transformer model as a solution to this issue. This model is particularly proficient in tasks like machine translation because it utilizes global self-attention mechanisms. In the context of FER, the Vision Transformer (ViT) employs self-attention to acquire resilient facial features from a global viewpoint, thereby improving its ability to extract features effectively.
Researchers have further explored the application of transformers to expression recognition, introducing attention-selective fusion modules, squeeze extraction attention modules, and locally optimized SWin transformers, each achieving notable improvements in accuracy on various datasets. The CSWin Transformer, employing cross-shaped window self-attention (CSWSA) and locally enhanced positional encoding, has demonstrated state-of-the-art performance in various vision tasks while maintaining computational efficiency.
Proposed framework
In the present study, researchers introduced FT-CSAT, a powerful transformer backbone comprising two main modules: the fine-tuning module and the channel-spatial attention module. The framework is based on the CSWin Transformer, which excels in extracting local facial expression information but struggles to learn global features due to its self-attention mechanism dividing input features into smaller blocks.
To overcome this limitation, a channel-spatial attention module is introduced to enhance the model's ability to capture crucial global information. To improve training efficiency, the CSWin Transformer is initialized with pre-training parameters from ImageNet-1K, enabling the model to leverage knowledge from a large-scale dataset and achieve faster convergence and better performance on downstream FER tasks such as FERPlus and RAF-DB. The authors adopt parameter fine-tuning to improve performance while reducing the number of parameters.
The channel-spatial attention module incorporates the convolutional block attention module (CBAM), which adapts features along two independent dimensions: channel and spatial attention. This enhancement focuses on informative channels and relevant spatial locations, thus improving feature extraction and the model's ability to identify key areas related to facial expressions.
Four integration approaches of CBAM into different stages of the CSWin Transformer are examined, and it is found that pooling in the channel and spatial domains effectively learns discriminative global and local features from facial expression images, strengthening the role of important spatial features in FER tasks.
The fine-tuning module employs the scaling and shifting features (SSF) parameter fine-tuning method, which enhances model performance while controlling the number of introduced parameters. SSF fine-tunes parameters through scale and shift operations on deep features extracted using a pre-trained transformer model without adding inference parameters. The SSF module is inserted after specific operations in the pre-training model, such as the multi-layer perceptron, CSWSA, and layer normalization, to modulate features and achieve fine-tuning without introducing additional parameters.
Study results
The experimental results evaluate the proposed expression recognition method on the RAF-DB and FERPlus datasets. Training images have random Gaussian noise for data enhancement, while the test set remains unchanged. The CSWin Transformer-T serves as the baseline with pre-trained ImageNet-1K parameters. Comparisons show the method's superiority, with 88.61% accuracy on RAF-DB and 89.26% on FERPlus, surpassing existing methods. Experiments on the Occlusion-RAF-DB and Pose-RAF-DB datasets confirm better performance in handling occlusion and pose variations. Grad-CAM visualizes attention maps, showing effective global attention perception. The ablation study validates each module's effectiveness, with channel-spatial attention improving accuracies by 0.72% and 0.84% and fine-tuning further enhancing them by 0.6% and 0.75% on respective datasets.
Conclusion
In summary, this study introduces FT-CSAT, a model based on the CSWin Transformer for FER. The approach includes integrating a channel-spatial attention module to improve global feature extraction and using fine-tuning for performance optimization and parameter control. Experimental results demonstrate FT-CSAT's superiority over state-of-the-art methods on the RAF-DB and FERPlus datasets. The model also shows robustness in handling facial occlusion and head pose changes. In future work, improvements may include considering multi-resolution strategies for facial scale changes and establishing a high-quality, large-scale facial expression database to enhance advanced deep-learning models in FER tasks.