In an article recently published in the journal Applied Sciences, researchers proposed Swin-APT for image semantic segmentation and object detection tasks in intelligent transportation systems (ITSs).
Background
ITSs increasingly incorporate technologies such as artificial intelligence (AI) and the Internet of Things (IoT) to provide real-time traffic data-based traffic information services. AI has been used extensively in ITSs as the technology can reduce human involvement while maintaining a high accuracy. Pedestrians and vehicles are crucial elements of the dynamic and complex road environment in urban traffic networks. The raw data for ITS is obtained from object detection and semantic segmentation tailored specifically for smart transportation.
The trajectories of both pedestrians and vehicles can be derived from detection and segmentation methods, which enable the inference of possible safety hazards. Images contain a substantial amount of underlying semantic information, and computer vision technology, which plays a crucial role in ITSs, assists intelligent vehicles in understanding the scene semantics.
Although the existing algorithms can achieve scene analysis independently in complex scenarios through object detection and semantic segmentation, they need sequential processing, resulting in unnecessary time consumption. In autonomous driving scenes, integrating the requirements of several tasks into a unified model enables effective information sharing among the tasks, improving the overall autonomous driving perception system performance. Moreover, models must display good accuracy and meet the real-time performance and computational efficiency requirements in practical applications such as traffic control and autonomous vehicles.
The proposed approach
In this study, researchers proposed Swin-APT, a deep learning (DL)-based approach for semantic segmentation and object detection in ITSs. The study's objective was to use DL-based algorithms for scene understanding and to realize segmentation predictions on traffic lane datasets to assist in road condition analysis.
Swin-APT incorporated a Swin-Transformer-based lightweight network and a multiscale adapter network designed for object detection and image semantic segmentation tasks. The model prediction accuracy was improved while maintaining a small computational cost by this network.
Additionally, an inter-frame consistency module/module based on the inter-frame consistency of image frames was proposed to obtain more accurate road information from images. The module was introduced to measure information consistency and contrastive learning between adjacent image frames. The adapter network was used in the multi-scale feature space to improve the scene object recognition rate/identify scene objects of various scales effectively in downstream tasks.
In the Swin-APT architecture, the encoding part of the network is composed of four consecutive Swin-Transformer blocks, and the proposed adapter network together forms a feature pyramid structure, which encodes the images into high-level semantic features.
Subsequently, these high-level semantic features were fed into the inter-frame consistency module, which was utilized to learn consistent information from the two parallel consecutive frames to encode the images’ semantic meaning. Eventually, the image features were passed through task-specific heads for road marking detection and road segmentation.
Experimental evaluation and findings
Researchers performed extensive experiments using a road mark detection dataset, CeyMo, and four public road semantic segmentation datasets, including CeyMo, BDD100K, CamVid, and SYNTHIA, to validate the proposed approach and find a balance between computational cost and accuracy.
Mean intersection over union (mIoU) and accuracy were utilized as evaluation metrics for the road segmentation task, while mean average precision (mAP) was used as the evaluation metric for the road marking detection task. Experiments on road segmentation datasets demonstrated that the proposed Swin-APT was a feasible and effective approach compared to the existing models that were employed as baselines in this study.
Swin-APT outperformed all other methods, including A-YOLOM, HybridNets, PSPNet, YOLOv8n(seg), DLT-Net, and MultiNet, on the BDD100K dataset by achieving the highest mIoU of 91.2%. The mIoU achieved by the proposed model was even higher than the mIoU attained by the recent state-of-the-art model A-YOLOM, which indicated that Swin-APT is the best model in road segmentation on the BDD100K dataset.
Swin-APT was the second-best model when experiments were performed on the CamVid benchmark dataset and attained 81.3% mIoU. The proposed model outperformed DFANet A, DenseDecoder, VideoGCRF, and ETC-Mobile and demonstrated a slightly lower performance than the best model DeepLabV3Plus + SDCNetAug that achieved the highest mIoU of 81.7% on the CamVid dataset, which indicated the versatility of the Swin-APT and displayed its effectiveness for different real-world applications. Similarly, in the synthetic SYNTHIA dataset, Swin-APT consistently outperformed its variants in all individual classes, including “Cyclist”, “Pedestrian”, “Vegetation”, “Car”, “Sky”, “Road”, and “Building”.
Overall, the proposed model Swin-APT achieved an improvement of up to 13.1% mIoU compared to the baseline models. Additionally, experiments on road marking detection on the CeyMo dataset showed that the proposed model led to an improvement of 1.85% mAP compared to the baseline model.
Journal reference:
- Liu, Y., Wu, C., Zeng, Y., Chen, K., Zhou, S. (2022). Swin-APT: An Enhancing Swin-Transformer Adaptor for Intelligent Transportation. Applied Sciences, 13(24), 13226. https://doi.org/10.3390/app132413226, https://www.mdpi.com/2076-3417/13/24/13226