In an article recently posted to the Meta Research Website, researchers explored advancements in text-to-3D models. They introduced image-based multiview three-dimensional (IM-3D), a novel approach that leveraged video-based generators and Gaussian splatting reconstruction algorithms to produce high-quality 3D outputs directly from text. By enhancing multiview generation and reducing the workload on two-dimensional (2D) generator networks, IM-3D significantly improved efficiency, quality, and consistency in generating 3D assets.
Related Work
Past work in text-to-3D generation has primarily relied on 2D image generators trained on large image datasets due to the need for sufficient 3D training data. Approaches like score distillation sampling (SDS) have been used but suffer from slow processing and artifacts. Research has focused on improving these limitations by fine-tuning 2D generators and exploring alternatives like direct 3D reconstruction from generated views. However, despite these advancements, existing methods still need help with challenges such as long processing times, potential for artifacts, and limitations in quality and efficiency. Addressing these drawbacks remains a critical area of research in text-to-3D generation.
Multiview Generator Network
The method introduces a video-based multiview generator network that leverages the emu video model, fine-tuned to produce high-quality videos from textual prompts. This approach utilizes a text-to-image model to generate an initial image, which guides the generation of up to 16 video frames depicting different views of a 3D object. Unlike previous methods, the model draws samples from a learned conditional distribution, allowing slight deviations from the input image to better fit the generated video. Researchers used an internal dataset of synthetic 3D objects to train the network, providing diverse examples with fixed camera parameters and random elevation.
The training dataset consists of turn-table-like videos of synthetic 3D objects, ensuring consistent views at fixed angular intervals. An in-house collection of high-quality 3D assets is selected based on their alignment with textual descriptions using CLIP. Each video in the dataset is generated by randomly choosing a camera elevation and capturing views around the object at uniform intervals.
To generate 3D assets from textual prompts, images, and videos are sampled from the Emu models and directly fit a 3D model using Gaussian splatting. This approach efficiently approximates the 3D opacity and color functions, allowing for fast rendering of high-resolution images at each training iteration. Image-based losses such as learned perceptual image patch similarity (LPIPS) and multi-scale structural similarity index measure (MS-SSIM) are employed for robust reconstruction, optimizing the 3D model directly via direct optimization.
This approach does not require SDS loss optimization, unlike traditional methods that rely on SDS for multiview consistency. Instead, an iterative process alternates between 3D reconstruction and video generation to compensate for inconsistencies, significantly reducing the number of model evaluations compared to SDS. This iterative process ensures efficiency and robustness in generating high-quality 3D assets from textual prompts.
Experimental Evaluation Summary
The experiments involve generating 3D objects based on textual descriptions and reference images. The method compares favorably with existing approaches, focusing on textual prompts commonly used for evaluation. Previous methods typically synthesize multiview image sequences or output 3D models. IM-3D assesses the quality of artifacts visually, comparing the directly synthesized image sequence J with the rendered views ˆJ of the 3D model. At the same time, J may exhibit better quality and faithfulness than ˆJ, which benefits from consistency by construction.
Researchers compared IM-3D against various state-of-the-art methods such as multiview dream (MVDream), Zero-shot one-shot few-shot learning with 123XL (Zero123XL), etc. Quantitative comparisons based on contrastive language-image pretraining (CLIP) similarity scores demonstrate IM-3D's superior textual and visual faithfulness. IM-3D's strength in visual faithfulness is particularly noteworthy, achieving results comparable to the input image I but with significantly less processing time.
Additionally, researchers conduct a human study to evaluate IM-3D against competitors regarding image alignment and 3D quality. Human raters consistently prefer IM-3D over other methods, indicating its ability to produce high-quality 3D results aligned with the image prompt.
Ablation studies assess the effectiveness of iterative refinements and the impact of using image-level loss functions. Results highlight the efficacy of iterative refinement processes in enhancing detail and the importance of image-level losses for generating high-quality 3D assets. Additionally, comparisons of different 3D representations reveal the efficiency of Gaussian splatting over neural radiance fields (NeRF) regarding visual quality and computational resources.
Despite its effectiveness, the fine-tuned video generator in IM-3D still has limitations, particularly with highly dynamic subjects. Spurious animations may occur, especially when the prompt involves verbs describing motion, which can pose challenges for 3D reconstruction.
Conclusion
To sum up, the introduction of IM-3D represented a significant advancement in text-to-3D models. IM-3D offered a novel approach for directly generating high-quality 3D outputs from text prompts through video-based generators and Gaussian splatting reconstruction algorithms. By improving efficiency, quality, and consistency in the generation process, IM-3D presented a promising solution for various applications requiring 3D asset generation from textual descriptions. The demonstrated superiority of IM-3D over quantitative and qualitative methods underscored its potential for practical use. Continuing research and development in this direction promised to enhance further the capabilities and applicability of text-to-3D models like IM-3D.