In a paper published in the journal Sensors, researchers introduced a lightweight enhancement to the YOLOv5 algorithm, leveraging integrated perceptual attention (IPA) and multiscale spatial channel reconstruction (MSCCR) modules. The proposed method reduced model parameters and boosted average accuracy (mAP@50). It maintained computational efficiency without increasing floating point operations per second (FLOPS)—this improvement optimized vehicle detection for intelligent traffic management systems, enhancing efficiency and functionality.
In addition to reducing model parameters and improving accuracy, integrating IPA and MSCCR modules provided richer contextual information for enhanced vehicle detection in diverse traffic environments. The optimized algorithm promises to advance intelligent traffic management and control systems significantly.
Related Work
Previous research in vehicle detection algorithms, primarily centered around you only looking once version 5 (YOLOv5), has focused on tackling challenges in intricate traffic environments. While original YOLO and YOLOtiny models offer different trade-offs in accuracy and computational complexity, recent enhancements have improved accuracy with increased complexity or reduced parameters with lower accuracy. Integrating transformer encoders improved performance but added computational cost, while lightweight networks like MobileNet sacrificed accuracy for simplicity. However, these approaches still need help with issues such as increased complexity or difficulty capturing detailed features in complex scenes.
YOLOv5 Enhancements and MSCCR Integration
In the improvements to YOLOv5s, integrated perceptual attention (IPA) and a C3_MR structure were introduced to redesign the backbone network. Inspired by the mobile vision transformer (MobileViT), a combination of convolution and self-attention principles was employed, with C3_MR used to aggregate shallow features and integrated perceptual attention to aggregate deep features. This reduced model parameters and facilitated hierarchical feature learning, enhancing the model's expressiveness.
The integrated perception attention module (IPA) aimed to mitigate the high computational cost of transformer encoders. IPA adopted a parallel two-branch structure, utilizing efficient attention for capturing global information and convolutional attention for local information. By incorporating the idea of grouping, IPA reduced parameters and computational complexity while effectively aggregating information from global and local branches.
Furthermore, the MSCCR is centered on spatial and channel reconstruction convolution (SCConv) to reduce computational redundancy and facilitate representative feature learning. By employing SCConv, MSCCR effectively reduced the number of parameters, with its parameters being a fifth of standard convolution parameters. Integrating efficient multiscale attention (EMA) into MSCCR facilitated multiscale spatial information acquisition without adding parameters.
In the C3_MR build, researchers replaced the bottleneck residual structure module in the YOLOv5 backbone network with MSCCR. This replacement aimed to address the loss of feature information while reducing parameters. MSCCR was approximately 1.8 times smaller than the bottleneck residual structure module, as shown by parameter comparisons, thus optimizing the backbone network's efficiency.
Advancements in Vehicle Detection
The study utilized the UA-detrac multi-object tracking and detection benchmark (UA-DETRAC) dataset, comprising surveillance videos from various locations and weather conditions, with 8250 vehicles and 1.21 million labeled objects. Researchers performed frame extraction to streamline the dataset and prevent redundancy, resulting in a new training and validation set.
Experimental equipment included an Ubuntu 20.04 long-term support (LTS) operating system with an Intel Xeon gold 6330 CPU, 128 GB of random access memory (RAM), and ray tracing extensions (RTX) 3090 graphics processing unit (GPU) with 24 GB of VRAM. Researchers employed PyTorch 1.10.1 with CUDA 11.8 as the deep learning framework, with a batch size of 32 and 100 training epochs.
The evaluation focused on mean average precision ([email protected]) and model parameter count to assess YOLOv5s' performance improvements. Researchers conducted a comparative analysis against a faster-region-based convolutional neural network (RCNN) and single shot multiBox detector (SSD), indicating superior accuracy and reduced parameters with the enhanced YOLOv5s algorithm.
Compared with popular networks such as mobilenet version 2 (MobileNetV2), MobileNetV3, and EfficientNet, the improved backbone network demonstrated higher mAP@50 and mAP@50:95 while maintaining a similar parameter count. Furthermore, comparisons with YOLOv3-tiny, YOLOv4-tiny, and the original YOLOv5s model showcased improved accuracy and reduced parameter count in the enhanced algorithm.
Visual results and gradient-weighted class activation mapping (Grad-CAM) visualizations depicted the enhanced model's superior adaptability and feature extraction capabilities, especially in complex environments. Ablation experiments further validated the effectiveness of the proposed improvements, highlighting enhanced accuracy without increasing model parameters.
Conclusion
To sum up, the study proposed integrating integrated perceptual attention into the YOLOv5s framework to create a lightweight vehicle detection model, aiming to address existing algorithms' complexities and hardware demands. Experimentally, the enhanced algorithm exhibited a 3.1% average precision increase on the UA-DETRAC dataset compared to YOLOv5s, outperforming SSD and Faster-RCNN with a 3.3% higher mAP@50 each.
Additionally, it surpassed other backbone networks, achieving a 5.6% to 6.7% higher mAP@50 than MobileNetV2, MobileNetV3, and EfficientNet, and demonstrated a 5.7% to 6.7% higher mAP@50 than YOLOv3-tiny and YOLOv4-tiny. The model proved effective across various scenarios, ensuring improved accuracy while reducing computational costs, thereby preventing the potential for deployment in resource-constrained devices. Future endeavors could focus on practical implementations in embedded devices, further refining the algorithm for real-world applications.
Article Revisions
- Jun 26 2024 - Fixed broken journal paper link.