In a recent article published in the journal Sensors, researchers introduced a novel method for estimating the dimensions of road vehicles. This approach utilizes monovision sensors and road geometry, powered by object detection and core vectors, to effectively address the challenges posed by multi-sensor setups, offering cost-effective solutions.
Background
The rapid advancement of information and communication technology has spurred cities globally to embrace the smart city concept, employing intelligent platforms to enhance various facets of urban life. Notably, intelligent transportation has undergone transformative changes, aiming to autonomously gather vital information about vehicle speeds, positions, and types through vision sensors. These data are instrumental in enabling intelligent traffic signal control, analyzing traffic risk behavior, predicting traffic conditions, and more. Consequently, there is an increasing emphasis on utilizing vision sensors to track, detect, and identify road users.
However, object detection remains a fundamental challenge in computer vision. Extensive research in object detection using deep learning models such as convolutional neural networks (CNNs) has gained popularity recently. Models like the You Only Look Once (YOLO) series have become prevalent for object detection. Recent research focuses on precisely estimating 3D bounding boxes for vehicles within their environment, which is crucial for applications such as object detection, collision avoidance, and tracking.
Traditionally, estimating 3D bounding boxes relied on multi-sensor setups, introducing cost, calibration complexity, and maintenance challenges. This has driven the demand for 3D object detection using single-vision sensors. However, a key challenge remains the need for high-quality labeled datasets. Efficient and computationally streamlined techniques are essential to enhance single-vision sensor-based 3D object detection. In this study, researchers introduce the "Vehiclectron" model, which precisely estimates 3D vehicle bounding boxes using road geometry information and monovision sensors.
Monovision sensor-based 3D vehicle detection
Data Sources: The researchers obtained road vehicle images from the AI-Hub platform for their experiments. AI-Hub provides access to various data types, including text and vision data. This dataset contains approximately 117,926 images, containing about 724,196 road vehicles. The dataset has various vehicles, including trucks, buses, general cars, bikes, and unknown objects. The unknown category encompasses unique or small-sized objects or those obscured in images. The dataset included cuboid annotations, represented by eight pairs of coordinates. The yellow boxes in the image designate the region of interest (RoI), which is crucial for capturing traffic flow and helps to facilitate flow vector calculation.
Proposed Model: The proposed model comprises three main components: object detection, core vector extraction, and cuboid estimation. For object detection, deep learning algorithms such as YOLOv7 are employed to identify road vehicles in 2D images. YOLOv7 boasts improvements in detector heads and backbone networks, enhancing real-time object detection accuracy without significant computational costs. This model provides class information and bounding box coordinates for detected objects, forming the basis for core vector extraction.
In the core vector representation phase, different vectors, namely flow vector, orthogonal vector, and perpendicular vector, are estimated for road vehicle cuboid estimation. The flow vector considers traffic flow and road geometry. The assumption is that vehicles follow lanes. Within the RoI, the lines are detected using the Hough transform. This method converts image points to points in Hough space. The highest-voted pattern in the accumulator array represents lane direction. Then, obtain multiple lanes within the RoI and calculate the average lane vectors to determine the flow vector. This vector signifies the vehicle's front direction.
After obtaining the flow vector, the orthogonal and perpendicular vectors are estimated. The orthogonal vector is perpendicular to the flow vector, forming the basis for the cuboid's orientation. The perpendicular vector extends vertically downward within the vehicle along the z-axis of the 2D image. The final cuboid is estimated based on object detection outcomes and core vector extraction.
Experimental objectives and 3D evaluation criteria
In this study, two key experiments are undertaken with distinct aims. The first aim is to identify road vehicles and their 2D bounding boxes using diverse object detection models to identify the optimal choice, and the second is to assess the accuracy of cuboid estimates through core vectors using 3D intersection over union (IoU) as the evaluative yardstick.
In 2D computer vision, the IoU is a standard measure to gauge object detection and segmentation accuracy. It quantifies the spatial overlap between predicted and actual regions, offering localization and segmentation insight. Calculated as the intersection area over the union area, it ranges between zero (no overlap) and one (perfect match).
IoU is extended to a 3D context, referred to as 3D IoU, which accounts for the sides of the cuboids. The 3D IoU average measures the spatial alignment between estimated and actual cuboids, providing a comprehensive 3D evaluation. The results validate the effectiveness of YOLOv7 for road vehicle detection, with estimated cuboids also performing well. While the proposed model, relying on object detection and core vectors, shows promise, we acknowledge its limitations, particularly its reliance on geometric data and object detection. Although some instances pose challenges, our approach pioneers monovision sensors for 3D cuboid estimation. This innovation holds potential for stationary video sensors such as closed-circuit televisions (CCTVs), enhancing traffic flow management by providing accurate vehicle coordinates and dimensions.
Conclusion
In summary, the proposed Vehiclectron model presents a valuable contribution by overcoming traditional limitations and demonstrating the potential of monovision sensor-based 3D object detection. It offers a practical and cost-effective solution, confirmed through real-world applications using CCTV footage from multiple roads.