In autonomous driving, semantic segmentation has evolved from sparse point-based to dense voxel-based methods. The goal is to predict semantic occupancy in 3D space. Existing 2D-projection methods are insufficient due to their limited scope. In a recent paper submitted to the arXiv* server, researchers introduced PointOcc, an efficient 2D projection model for 3D semantic occupancy prediction.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Accurate 3D environmental perception is crucial for autonomous driving. The ability of light detection and ranging (LiDAR) to detect 3D structural information dominates sensor choices. LiDAR-based models excel in 3D object detection, semantic segmentation, and object tracking. Semantic segmentation, specifically in LiDAR, assigns categories to each point, typically through voxelization or 2D projections on point clouds. However, 2D projections often fall short due to information loss and the need for post-processing. Sparse point cloud semantic segmentation struggles to provide comprehensive descriptions, leading to the emergence of 3D semantic occupancy prediction as a challenging alternative.
Revolutionizing LiDAR Segmentation: The TPV Advantage
The fundamental task in LiDAR-based semantic perception is to assign semantic labels to individual points in LiDAR point clouds. State-of-the-art methods use 3D voxel grids and convolutional networks, but this approach demands substantial computation and storage. Alternatively, 2D-projection-based methods project point clouds onto 2D planes, reducing computational demands. However, they lose some structural information. The current study introduces the tri-perspective view (TPV) representation, effectively capturing 3D structures with three orthogonal 2D planes. A new cylindrical TPV representation is proposed for LiDAR, reducing information loss. For 3D occupancy prediction, which requires comprehensive scene understanding, voxel-based methods prevail but are resource-intensive.
Efficient Representation for Point Clouds
The authors presented an efficient approach for point cloud processing, particularly in the context of 3D semantic occupancy prediction and LiDAR segmentation. Traditional methods employ dense voxel representations for 3D scene descriptions, but their computational demands restrict the achievable resolution. Conversely, 2D-projection-based methods, such as range views, reduce complexity but lose radial information, rendering them unsuitable for dense prediction tasks.
In response, the authors propose PointOcc, introducing the TPV concept to point cloud perception. This innovation preserves the ability to model complex 3D scenes while mitigating computational and storage complexity. PointOcc's architecture comprises three components: a LiDAR projector, TPV encoder-decoder, and a task-specific head.
PointOcc uses cylindrical partition and spatial pooling to convert point clouds into cylindrical TPV inputs, featuring three 2D perpendicular planes that distribute points evenly. These TPV planes are then processed by a 2D backbone and FPN, yielding TPV features that can be transformed into point and voxel features in 3D space. A task-specific head predicts semantic labels for both dense voxel prediction and point-wise LiDAR segmentation tasks.
By embracing TPV representation, PointOcc alleviates computational and storage complexity while retaining the capacity to model intricate 3D scenes. The cylindrical partition technique, coupled with spatial group pooling, allows efficient 2D processing while preserving 3D structural information. This approach significantly enhances point cloud processing efficiency, making it applicable to diverse 3D scene understanding tasks.
Experiments and Analysis of Results
The authors conducted an evaluation of their method on two benchmarks: OpenOccupancy for 3D semantic occupancy prediction and Panoptic nuScenes for LiDAR segmentation. In 3D Semantic Occupancy Prediction, the perceptive range is [-51.2m, -51.2m, -5m] to [51.2m, 51.2m, 3m] with a voxel size of 0.2m. They used TPV features for semantic label prediction, evaluating with mIoU and IoU. In Lidar Segmentation, TPV planes were used to predict semantic labels for each point, and the evaluation metric was mIoU.
They employed a consistent model architecture for both tasks, combining cylindrical partition, spatial group pooling, and a 2D backbone such as Swin Transformers (SwinT). The training utilized an Adam optimizer with weight decay and a cosine learning rate scheduler. For inference, voxel features were obtained and upsampled for occupancy prediction. Their PointOcc model outperformed previous methods on both tasks, demonstrating its efficiency and effectiveness. They also explored various aspects of their model's performance, including TPV planes' complementary properties, spatial resolution, group size, 2D backbone initialization, and visualizations of 3D semantic occupancy prediction.
Conclusion
In summary, researchers introduced the cylindrical TPV representation for point-based models. It enables efficient modeling of intricate 3D structures using a 2D image backbone. The proposed cylindrical partition and spatial group pooling methods transform point clouds into TPV space while preserving structural details. Experimental LiDAR segmentation and occupancy prediction results demonstrate PointOcc's superiority over 2D projection-based methods and its competitiveness with voxel-based approaches. However, the scalability for higher-resolution scene modeling remains a limitation, as the segmentation head still computes dense 3D features.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Zuo, S., Zheng, W., Huang, Y., Zhou, J., and Lu, J. (2023). PointOcc: Cylindrical Tri-Perspective View for Point-based 3D Semantic Occupancy Prediction. arXiv. DOI: https://doi.org/10.48550/arXiv.2308.16896, https://arxiv.org/abs/2308.16896