In an article recently submitted to the ArXiv* server, researchers proposed a new three-dimensional (3D) human reconstruction framework, Ultraman, for 3D human reconstruction with ultra-detail and speed from a single, front-view image.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Existing limitations for reconstruction
3D reconstruction of the human body has been a persistent problem in both graphics and computer vision fields. Image-based human body reconstruction, which involves recovering the 3D human texture and shape from the images of a person, serves as the fundamental component in online social networking, virtual reality, and digital entertainment domains.
Reconstructing a 3D human only from a single image is a major technical problem, as a single image fails to capture the detailed appearance of a human body. Thus, inferring the appearance and geometry of the human's invisible parts is crucial during such reconstruction, which necessitates the incorporation of humans' 3D priors in the human reconstruction technique.
Conventional methods introduce parametric human shape models like SCAPE and SMPL to address this issue. However, these methods solely focus on human shape reconstruction without considering appearance. Additionally, the accurate representation of loose and complex clothing worn by humans is challenging due to these models' limitations.
Although recent studies have improved the outcome in such cases by integrating normal or depth estimation with 3D human reconstruction, which resulted in shape estimation with higher reliability, the reconstructed human appearance lacks detail or receives unreasonable textures.
The proposed Ultraman approach
In this study, researchers proposed a new method, Ultraman, to fully reconstruct textured 3D human models/high-quality 3D human geometry and appearance reconstruction from a single image. The proposed novel single-image input-based 3D human reconstruction framework can recover the high-quality texture and shape of the human.
A depth estimation-based method was utilized for 3D human shape extraction from a single model. Then, the estimation results were enhanced using post-processing techniques like mesh simplification. The objective was to reconstruct the textured 3D human from a single RGB front-view image, which has the potential for different applications due to enhanced usability and efficient acquisition of data.
Ultraman significantly improves the reconstruction accuracy and speed compared to existing techniques while preserving high-quality texture details. It comprises three key modules, including the mesh reconstruction module, the multi-view image generation module, and the texturing module.
This framework reconstructed a high-quality body mesh from a single image and completed the invisible parts using a texturing strategy and a multi-view image generation module. The mesh reconstruction module generated the 3D UV maps and human mesh corresponding to the front view, while the multi-view image generation module generated images from unobserved views. The texturing module added texture to the human body mesh.
Ultraman methodology
Initially, the input image was fed to the mesh reconstruction module for UV map export and mesh reconstruction. Concurrently, GPT4v was used to respond to questions to facilitate a more thorough description of the individual in the input image and enable an accurate prompt generation.
Then, the generated prompt was fed to the multi-view image generation module. This module consists of a redesigned control model containing an IP adapter and a ControlNet. The control model controlled the texture generation in the current viewpoint using the depth map rendered by the input image and the mesh by accepting the prompt from the current viewpoint.
The texture image from the current viewpoint, along with the corresponding generation mask, was used in the texturing module to add texture to the body mesh. Eventually, the gap between the various generation mask regions was determined and smoothed to obtain the output.
Significance of the work
Researchers performed extensive evaluations and experiments to evaluate the performance of Ultraman using different standard datasets. Ultraman demonstrated superior performance on different datasets. The novel framework outperformed existing state-of-the-art methods based on human rendering quality and speed.
Ultraman displayed good reconstruction results for different genders, standing postures, and dresses. A good degree of reproduction was obtained in details like holes in pants, watches, or crossed hands. Ultraman outperformed existing state-of-the-art single-image human reconstruction methods, including PaMIR, PIFu, and TeCH, in human back mapping generation.
Additionally, the proposed framework clearly distinguished the character features geometrically. Ultraman also showed a higher quality reconstruction compared to PaMIR and PIFu on non-fitted garments, and better performance based on texture and geometry compared to these two existing methods.
Moreover, Ultraman generated a result in 20-30 minutes, while existing methods like TeCH took 4-5 hours to generate the same result/a human mesh with textures. Thus, the proposed method improved the inference speed by 93% over the current state-of-the-art methods. In the user study, users were asked to select the best model among the models obtained using Ultraman, TeCH, PaMIR, and PIFu. Results from this study showed that users considered 90.5% of the Ultraman results as the best results.
To summarize, this study's findings demonstrated the feasibility of using Ultraman in different downstream applications, including virtual reality and digital entertainment.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Chen, M., Chen, J., Ye, X., Gao, H., Chen, X., Fan, Z., Zhao, H. (2024). Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail. ArXiv. https://doi.org/10.48550/arXiv.2403.12028, https://arxiv.org/abs/2403.12028