In an article recently submitted to the arxiv* server, researchers introduced Light and Accurate Face Detection (LAFD), a precise and lightweight face detection algorithm. LAFD was constructed upon Retinaface and utilized a modified MobileNetV3 backbone.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
The present paper makes significant contributions by adjusting the convolutional kernel size and channel expansion and integrating the "Squeeze-and-Excitation (SE) attention mechanism. The Deformable Convolution Network (DCN) and the focal loss function were incorporated. Results from tests on the WIDERFACE dataset underscored LAFD's substantial accuracy improvements over Retinaface and LFFD. With enhancements reaching up to 8.3%, LAFD maintained its lightweight architecture, retaining a size of only 10.2MB.
Background
Face recognition plays a pivotal role in daily life, and it has evolved through early algorithms, the Adaptive Boosting framework, and the deep learning era. The initial modular matching approach used template images to identify faces, while the AdaBoost framework and Viola-Jones algorithm significantly improved accuracy and speed.
The advent of deep learning introduced techniques like Convolutional Neural Networks (CNN), leading to breakthroughs like Faceness-Net, a deep convolutional network-based algorithm achieving substantial detection improvements. In 2022, YOLOv7 models, including yolov7-tiny and yolov7-lite-s, delivered lightweight and accurate face detection. The Retinaface model aimed for fast detection but struggled with accuracy for complex faces.
Related work
In past studies, Retinaface was a lightweight single-stage face detection network that demonstrated notable performance by employing MobileNetV1 as its backbone network on the validation subsets of the WIDERFACE dataset. The central process of the algorithm involved putting the training dataset into the MobileNetV1 backbone, generating feature maps, performing feature fusion, and then extracting feature pyramid structures through the utilization of the context module. The feature pyramid layers encapsulated varying scales of face information.
In Retinaface, the Small Stage Headless Face Detector (SSH) was employed as the context module, enhancing the model's receptive field to boost the detection of small faces. Furthermore, Retinaface's architecture pre-defined multiple prior boxes allowing for the detection of faces across the image. Each pixel in the feature maps corresponded to two sizes of prior boxes, establishing a complex network of prior boxes designed to accommodate face detection at various positions.
With four types of head predictions – face classification, face box point regression, face key point regression, and 3D dense point regression – Retinaface refined its predictions and incorporated Smooth-L1 loss functions for precise estimation. However, the abundance of overlapping face boxes stemming from the numerous prior boxes posed a challenge. To address this, Retinaface was used along with non-maximum suppression (NMS), ensuring the selection of the most relevant face box. This multi-faceted approach formed the foundation of Retinaface's robust and efficient face detection mechanism.
Proposed method
The LAFD algorithm in the present study introduces significant improvements to its model across three key areas: the backbone network, context module, and loss function. By enhancing the MobileNetV3 backbone module, the channel expansion multiplier is augmented to facilitate greater image information extraction. Furthermore, integrating a 7x7 convolutional kernel widens the receptive field while incorporating the SE attention mechanism enhances feature extraction across various stages.
The method also introduces the DCN to effectively recognize irregular targets. Employing a combined approach of Cross-Entropy Loss and Focal Loss Function further bolsters model accuracy, particularly in recognizing small faces. However, challenges arise from excessive prior boxes during post-processing, potentially causing delays in NMS.
To mitigate this, the algorithm employs Cross-Entropy Loss in early training epochs and transitions to Focal Loss Function in later epochs, maintaining NMS efficiency while improving accuracy. This nuanced interplay between mitigating false recognition and enhancing small-face recognition highlights the method's intricacies, ultimately leading to heightened recall and average accuracy at the cost of reduced precision.
Experimental results
The WIDERFACE dataset, encompassing diverse and challenging facial variations, was employed. The training process utilized the PyTorch framework. Images were proportionally increased to a maximum length of 1560 or width of 1200 before model testing. A detection threshold of 0.5 and an Intersection over Union (IOU) value greater than 0.4 for NMS were applied for model outputs.
The LAFD model showcased substantial improvements with an average accuracy of 94.3%, 92.6%, and 86.2% on the WIDERFACE validation subsets, outperforming Retinaface by 3.6%, 4.4%, and 12.4%, respectively. Traditional methods, like Viola-Jones (V-J) and Deformable Part-based Model (DPM), exhibited lower accuracy due to limited feature extraction capabilities. Faceness-Net and ScaleFace had drawbacks in multi-size feature extraction and attention mechanism, respectively.
Single-stage detectors SSH and Single Shot MultiBox Detector (SSD) showed better results. Larger models like FANet and TinaFace were less suitable for embedded scenarios, unlike LAFD. The lightweight YOLOv7-tiny performed comparably but with a smaller model size.
Furthermore, ablation experiments were conducted, revealing the impact of each incorporated module on the deep convolutional network. The new backbone network, Focal Loss Function, deformable convolution, and resizing of test images all contributed positively to Retinaface. The integration of DCN and Focal Loss showed contradictory effects, favoring DCN for its higher accuracy improvement. The model employed a modified MobileNetV3 backbone network with DCN, resulting in a remarkable improvement of 3.3%, 3.9%, and 8.5% across subsets compared to Retinaface. Scaling input images to specific dimensions further boosted accuracy by 3.3%, 4.1%, and 12.4%, respectively, relative to Retinaface.
Conclusion
This work enhances the Retinaface single-stage lightweight face detection network by refining its MobileNet-V3 backbone. Modifications include the SE attention mechanism, Inverted Residuals Block's channel expansion multiplier, and convolution kernel size adjustments for better face detection performance. The Deformable Convolution Network replaces the original SSH layer convolution, and the Cross-Entropy loss function is substituted with the Focus Loss function. Input images are preprocessed by resizing them to 1560px in length and width or 1200px in width equally. Future work will explore the use of Generalized Intersection over Union (GIOU), Distance Intersection over Union (DIOU), and other 2D loss functions, investigate the interplay between the Focal Loss function and DCN, and further optimize the MobileNetV3 backbone network parameters.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.