In a recent paper submitted to the arXiv* server, researchers introduced Neural Radiance Field (NeRF)-Det, a novel approach for indoor 3D detection using posed RGB (red, green, and blue) images as input.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
The current study centers on indoor 3D object detection from posed RGB images, a vital task in computer vision applications such as robotics, augmented reality (AR), and virtual reality (VR). Most existing 3D detection approaches incorporate both RGB images and depth (RGB-D) measurements. However, the unavailability of depth sensors in many AR, VR, and mobile phones presents challenges in understanding scene geometry from RGB-only images.
To address this, the authors propose NeRF-Det, explicitly modeling scene geometry as an opacity field by jointly training a NeRF branch with the 3D detection pipeline.
Elevating indoor 3D object detection techniques
The study delves into indoor 3D object detection, employing various methods depending on input types, notably point clouds and voxel representations. Techniques such as 3D Semantic Instance Segmentation and VoteNet have been effective, but the lack of depth sensors on certain devices, such as VR/AR headsets, poses challenges. The models Panoptic3D and Cube region-based convolutional neural networks (R-CNN) address this issue by extracting point clouds from predicted depth and directly regressing 3D bounding boxes from 2D images, respectively.
However, a more promising approach is multi-view, which does not rely on depth sensors and offers greater accuracy. Still, the existing state-of-the-art multi-view method does not adequately incorporate geometric information. To rectify this, the authors leverage NeRF to enhance 3D detection by embedding geometry into the volume.
Fusing NeRF and 3D object detection
The method, referred to as NeRF-Det, is designed for indoor 3D object detection using posed RGB images. It extracts image features and projects them into a 3D volume, leveraging NeRF to infer scene geometry from 2D observations. To achieve this, 3D object detection and NeRF are entangled with a shared multi-layer perceptron (MLP), allowing the multi-view constraint in NeRF to enhance geometry estimation for detection.
In the 3D detection branch, RGB frames are processed through a 2D image backbone, creating a 3D feature volume by attaching 2D features to their corresponding positions in 3D space. A 3D coordinate system is established to build a 3D grid of voxels, with features projected accordingly. Multi-view features are then aggregated.
The NeRF branch samples features from higher-resolution 2D image feature maps and incorporates priors to optimize geometry estimation. It also augments pixel RGB values into sampled features. The opacity field, modeling scene geometry, is generated, and its density field is transformed into the opacity field. The shared geometry-MLP (G-MLP) connects the two branches during training and inference.
Joint end-to-end training involves supervision for both detection and NeRF branches. Depth-ground truth can be optionally used during training but is not required during inference. The network is generalizable to new, unseen scenes.
Comprehensive experimental insights and analyses
The authors primarily adhere to the image-to-voxel projection technique, called ImVoxelNet, for their detection branch, encompassing components such as backbones, detection heads, resolutions, and training strategies. Their implementation is grounded in the MMDetection3D platform, marking the first instance of NeRF integration within MMDetection3D. Additionally, the authors pioneer the application of NeRF-style novel depth estimation and view synthesis on the complete ScanNet dataset, a departure from prior works limited to a small subset of scenes.
NeRF-Det's performance is rigorously evaluated for indoor 3D object detection. It is compared with point-cloud and RGB-D-based methods as well as the RGB-only method ImVoxelNet on the ScanNet dataset. With residual network 50 (ResNet50) as the image backbone, NeRF-Det-R50-1x surpasses ImVoxelNet-R50-1x by 2.0 mean average precision (mAP), and NeRF-Det-R50-1x* with depth supervision further enhances detection performance by 0.6 mAP. Extending training to 2x iterations, NeRF-Det-R50-2x achieves 52.0 mAP, outperforming ImVoxelNet-R50-2x by 3.6 mAP. When ResNet50 is replaced with ResNet101, NeRF-Det-R101-2x attains 52.9 mAP at intersection over union (IoU) threshold 0.25, surpassing ImVoxelNet. These results highlight the effectiveness of NeRF-Det, especially when depth supervision is incorporated.
Qualitatively, NeRF-Det demonstrates precise detection even in densely populated scenes with varying object scales. Scene geometry modeling methods, including depth maps and cost volumes, are compared, with NeRF-based modeling exhibiting significant improvements. Furthermore, NeRF-Det's joint approach is compared to a NeRF-then-Det method, showcasing its superior performance. The authors also delve into the influence of the detection branch on novel view synthesis and depth estimation, emphasizing the importance of accurate geometry modeling.
In the Ablation Study, various components of NeRF-Det are examined, including shared G-MLP, feature sampling strategies, different losses, and features' impact on performance. The study underscores the critical role of multi-view consistency and variance features in enhancing geometry cues. Additionally, the authors explore how the detection branch affects novel view synthesis, revealing intriguing findings for future research.
Conclusion
In summary, researchers introduced NeRF-Det as a novel approach for 3D detection from posed RGB images. It deeply integrates multi-view geometry constraints from NeRF into 3D detection through a shared geometry MLP. To enhance NeRF-MLP's generalizability, it leverages augmented image features as priors and samples features from high-resolution images. This work underscores NeRF's significance in 3D detection and provides insights into optimizing its performance.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.