In an article published in the journal Nature, researchers addressed the persistent issue of wildlife roadkill, proposing a cost-effective solution through machine learning (ML)-based detection systems. Focusing on Brazilian endangered animal species, the authors evaluated the performance of various you-only-look-once (YOLO)-based object detection models using limited training data.
Background
The escalating human-wildlife conflicts, exacerbated by factors such as climate change and urban development, have intensified the issue of wildlife roadkill worldwide. Road incidents resulting in the death of various species, particularly in Brazil, are on the rise, with small-sized animals constituting 90% of the victims. Despite the alarming statistics and potential threats to endangered species, existing road infrastructure and technological solutions for automatic animal detection have fallen short.
This study addressed the critical challenge of automatically detecting and classifying road-killed animals using computer vision technology, considering the scarcity of target-domain training datasets. The authors focused on Brazilian highways, where roadkill incidents were alarmingly common. Previous works have employed computer vision for wildlife monitoring, yet the inadequacy of consolidated datasets and the challenge of small-sized datasets persist. The authors built on this by conducting a comprehensive evaluation of state-of-the-art YOLO-based detectors, considering performance metrics, image quality aspects, and the specific challenges of animal detection on roads.
To bridge the gaps in existing research, the researchers proposed the use of data augmentation and transfer learning techniques to enhance model training with limited data. The evaluation included metrics like mean average precision (mAP)@50, precision, recall, and frames per second (FPS), providing valuable insights into the suitability of different detectors for real-world deployment, especially on edge or mobile devices with limited resources.
The YOLO architecture
The YOLO architecture, introduced in 2015, revolutionized real-time object detection by dividing images into small grids and using non-maximum suppression for precise bounding box selection. Over iterations, YOLO underwent enhancements, with YoloV2 incorporating features like batch normalization, higher resolution layers, and anchor usage. YoloV3 introduced logistic classifiers, Darknet-53, and multi-scale bounding box predictions. Subsequently, YoloV4 improved performance by incorporating the bag of freebies and the bag of specials techniques, while YoloV5, implemented in PyTorch, offered multiple grid sizes for varied processing power demands. Scaled-YoloV4, an improvement over YoloV4, utilized cross-stage partial networks for better performance in detecting large objects.
Further innovations included YoloR, emphasizing implicit and explicit learning; YoloX, which refrained from anchor usage and introduced SimOTA for optimized object tracking; and YoloV7, achieving superior speed and accuracy. YoloV7 reduced gradient propagation, incorporated an extended efficient layer aggregation network, and introduced auxiliary head coarse-to-fine for enhanced predictions during training. YoloV7 outperformed various real-time detectors, reaching 56.8% average precision (AP) on the Microsoft common objects in context (MS COCO) dataset validation set.
Methodology
The research methodology involved three key steps in evaluating YOLO-based object detection models. In step one, the models underwent two types of training, first on 80% of the Brazilian road animals (BRA)-dataset without data augmentation, and then on the same set with augmentation techniques such as horizontal/vertical shifts, flips, and rotations. Data augmentation served as regularization to prevent overfitting and enhance model generalization. Transfer learning was employed with pre-trained models.
In step two, the trained models were tested on the BRA-Dataset validation set and videos recorded in the ecological park of São Carlos, Brazil, and free internet videos. The tests assessed inference speed on both graphics processing unit (GPU) and edge devices, providing insights into real-world deployment scenarios.
Step three involved the comparison of results, emphasizing precision, recall, and mAP metrics on the BRA-Dataset. A qualitative analysis was performed on videos to evaluate model performance in challenging scenarios like occlusion, distant objects, and poor image quality. The evaluation aimed to provide a comprehensive understanding of the models' capabilities beyond standard quantitative metrics.
The BRA-Dataset, featuring Brazilian fauna species vulnerable to road accidents, was utilized for training and testing. With 1823 images across five classes, the dataset was diverse and labeled in YOLO Darknet and Pascal visual object classes (VOC) formats.
Results
The researchers evaluated various YOLO-based object detection models for animal detection, considering both the BRA validation dataset and video recordings. Models were trained with and without data augmentation on the BRA-Dataset, a collection of images featuring Brazilian fauna vulnerable to road accidents. Results on the validation set indicated potential overfitting in models without data augmentation, as observed by exceptionally high precision or recall values. Models trained with augmentation showed more realistic metrics, with Scaled-YOLOv4-p5 demonstrating the best overall recall.
YOLOv7 performed less effectively, exhibiting lower recall values. Evaluation based on videos revealed trade-offs between model complexity, accuracy, and speed. YOLOv5 Nano achieved the highest average FPS on a dedicated GPU, making it suitable for mobile and edge devices. However, on Raspberry Pi, models faced limitations due to memory constraints, hindering heavier models. Qualitative analysis of videos highlighted challenges such as occlusion, distant objects, and camouflage, where even complex models struggle. Despite limitations, the study provided insights into the trade-offs for deploying YOLO-based models in real-world scenarios, emphasizing the need for a balanced consideration of accuracy, speed, and complexity.
Conclusion
The study compared YOLO architectures for highway animal detection, finding Scaled-YOLOv4 excelled in mitigating false negatives. YOLOv4 and YOLOv5 demonstrated strong overall performance, with YOLOv5-Nano leading in FPS for video inference. Data augmentation proved effective for training, improving metrics in all models except YOLOv7.
Challenges included small dataset variations and classic computer vision issues. Future work involves dataset reassessment, exploring new augmentation techniques, testing on edge computing devices, and evaluating models in diverse occlusion scenarios globally. Despite challenges, YOLO architectures presented viable solutions for real-time animal detection on highways.
Journal reference:
- Ferrante, G. S., Vasconcelos Nakamura, L. H., Sampaio, S., Filho, G. P. R., & Meneguette, R. I. (2024). Evaluating YOLO architectures for detecting road killed endangered Brazilian animals. Scientific Reports, 14(1), 1353. https://doi.org/10.1038/s41598-024-52054-y, https://www.nature.com/articles/s41598-024-52054-y