In an article recently submitted to the ArXiv* server, researchers proposed DiffusionEngine (DE), a novel data scaling-up engine that can provide detection-oriented high-quality training pairs in a single stage.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
In recent years, object detection has become prevalent in several vision applications, such as scene understanding and recognition. However, the effectiveness of these object detection-based vision applications primarily depends on high-quality training data comprising images with granular box-level annotations.
The training data is typically obtained by manually annotating a substantial number of images collected from the web, which is an expensive, expert-involved, and time-consuming process. Additionally, real-world images follow a long-tail, out-of-domain, or data-sparse distribution, which increases the difficulty and uncertainty in this conventional data collection process.
Recently, the diffusion model has received significant attention for image stylization and generation, with several studies investigating the application of the diffusion model in assisting object detection tasks. For instance, X-Paste can copy and paste the generated foreground objects into existing images for scaling data. Similarly, DALL-E for Detection can generate the background context and foreground objects separately and then utilize the copy-paste technology for synthetic image generation.
However, the need for additional expert models for labeling in these solutions increases the cost and complexity of the overall data scaling process. Additionally, these methods cannot paste the generated objects properly into repeated images, leading to limited diversity and the creation of unreasonable images.
Moreover, the annotation and image generation processes are separated without completely leveraging the location learned from the diffusion model and detection-aware concepts of semantics. These drawbacks of existing solutions have necessitated the development of an effective, scalable, and simple algorithm for detection data scaling.
DiffusionEngine for detection data scaling
In this paper, researchers proposed a novel tool DE composed of a Detection-Adapter (DA) and a pre-trained diffusion model for object detection data scaling. The pre-trained diffusion model can implicitly learn location-aware semantics and object-level structure, which can be used explicitly as the object detection task backbone.
Additionally, the DA can be constructed using different detection frameworks for detection-oriented concept acquisition from the frozen diffusion-based backbone to generate precise annotations. The DE is versatile and efficient as it eliminates the complex multi-stage processes for scaling up data by designing a DA for training pair generation in a single stage. Thus, DE can be utilized in object detection models in a plug-and-play manner to improve their performance. DE possesses an exceptional labeling ability as the DA aligns the off-the-shelf diffusion model-learned implicit knowledge with task-aware signals. Moreover, DE also possesses an infinite data scaling capacity and can expand thousands of data.
Researchers created two scaling-up datasets using DE to facilitate future studies on object detection. These datasets scaled up the original annotations and images to provide diverse and scalable data for developing next-generation state-of-the-art (SOTA) detection algorithms.
Researchers used the SOTA detection framework DINO as the DA for the DE in this study. Existing object detection benchmarks were leveraged by researchers for DA learning, where the latent diffusion model (LDM) features were obtained by simulating the last denoising step with real images.
The major advantages of the one-step training procedure were the need for only image-detection pairs for training, well-preserved components and layout of the original image after the inversion, and the direct use of existing labeled detection benchmarks for DA learning without additional data labeling and collection efforts.
The learned DA-equipped DE effectively scaled up the data in a single stage and two scaling-up datasets, including visual object classes (VOC)-DE and common objects in context (COCO)-DE, were constructed. Images from COCO train2017 were used as references, and their corresponding captions were used as text prompts for COCO-DE. Similarly, images from the Pascal VOC trainval0712 split and generic text prompt were used for VOC-DE. An image-guided text-to-image generation process was applied to scaling-up datasets.
Experimental evaluation of DE
Researchers evaluated the effectiveness of data scaling using DE on the extensively used COCO object detection benchmark. Different detection algorithms, including the anchor-free algorithm DINO, anchor-based two-stage algorithm Faster-RCNN, and the anchor-based one-stage algorithm RetinaNet were used during the experiments. They also used the VOC-0712 dataset to experimentally verify the generalization of DE using Faster-RCNN with ResNet50 backbone.
Additionally, the DE was compared with Copy-Paste and DALL-E for Detection SOTA data scaling-up techniques, with Faster-RCNN with ResNet50 being used as the backbone for the experiment on the VOC2012 segmentation set. The robustness of DE in out-of-domain scenarios was determined by performing experiments on the Clipart1k dataset that contains 500 Clipart domain images.
Significance of the study
The incorporation of data generated using DE to RetinaNet and DINO with ResNet50 backbone led to a 3.3% and 3.1% increase in mean average precision (mAP), respectively, on COCO compared to the mAP of baseline RetinaNet and DINO algorithms, which indicated the feasibility of combining DE-generated data with various detection algorithms to realize consistent performance gains.
The inclusion of DE also improved the mAP of Faster-RCNN with ResNet50 backbone and DINO with Swin-L backbone on COCO by 4.8% and 1.7%, respectively. Moreover, the mAP of Faster-RCNN with ResNet50 backbone was increased by a maximum of 7.6% on VOC-0712 when DE-generated data was combined with the algorithm. The gain in mAP was 8.9% and 6.5% for Faster-RCNN while using DALL-E and Copy-Paste data scaling techniques, respectively.
However, the gain in mAP of Faster-RCNN while adding twice and nine times the original data using DE was 8.1% and 20.9%, respectively, which indicated the effectiveness of DE in improving the performance of object detection models compared to other SOTA data scaling techniques. Models trained using DE outperformed the models trained using Clipart by 11.5%, which displayed the efficacy of DE in scaling up cross-domain data.
To summarize, the findings of this study demonstrated that DE was generalizable, diversified, and scalable and can be used to effectively realize significant performance improvements of object detection models in different settings.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Zhang, M., Wu, J., Ren, Y., Li, M., Qin, J., Xiao, X., Liu, W., Wang, R., Zheng, M., & Ma, A. J. (2023). DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection. ArXiv. https://doi.org/10.48550/arXiv.2309.03893, https://arxiv.org/abs/2309.03893