In an article recently submitted to the ArXiv* server, researchers introduced a novel approach called Text-Aided Dynamic Avatar (TADA), which generated expressive 3D avatars from textual descriptions. The synergy of TADA involved a 2D diffusion model and a parametric body model, resulting in high-quality geometry and lifelike textures.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Unlike existing methods, TADA ensured alignment between geometry and texture, enabling realistic animation with semantic consistency. The approach employed upsampled SMPL-X with a displacement layer and texture map, along with hierarchical rendering and score distillation sampling (SDS). Both qualitative and quantitative evaluations highlighted the superiority of TADA in creating detailed and realistic digital characters for animation and rendering guided by text descriptions.
Related work
Recent studies have extended text-to-2D-image generation techniques to the emerging field of text-to-3D-content creation. Specifically focusing on text-to-3D-avatar generation, diverse methods, and associated challenges come to light. While some utilize CLIP-based optimization to address shape and texture issues, they face hurdles in generating lifelike 2D renderings. Others leverage SDS from 2D models to optimize 3D representations, bridging gaps in 3D model training.
Methods like TEXTure and DreamFusion optimize textures and Neural Radiance Fields yet encounter challenges in slow optimization and low-resolution contexts. Magic3D adopts a dual-stage approach, while Fantasia3D disentangles geometry-texture interplay, but both lack immediate animation readiness. In text-to-3D-avatar generation, AvatarCLIP, DreamAvatar, and DreamHuman methods face hurdles related to geometry-appearance quality, compatibility, and animation capabilities.
Proposed method
The primary goal of TADA is to create full-body avatars of exceptional quality that can be animated, all driven by textual prompts. The process begins with the initialization of a 3D avatar using an upsampled SMPL-X model. This model is defined by shape, pose, and expression parameters. To enhance the level of detail, learnable displacements are incorporated, which contribute to the development of a highly detailed "clothed" avatar. One of the most crucial aspects of the approach of TADA is ensuring the harmony between the geometry of the avatar and its texture. This is achieved through the use of SDS losses. These losses take into account both the normal images and RGB images in the latent space. By incorporating both types of images, TADA ensures that the generated avatars possess both coherent geometry and life-like textures.
Furthermore, TADA emphasizes maintaining semantic consistency with the SMPL-X model. This is particularly important for animations. During the training phase, the method introduces various gestures and expressions. This approach ensures that the resulting avatars can be animated using the pose and expression spaces provided by the SMPL-X model. This enables the avatars to exhibit natural and coherent movements in their animations.
The technical foundation of TADA includes the adoption of the SMPL-X+D representation, a versatile tool for creating animatable avatars. The integration of a learnable displacement parameter adds a layer of personalization by capturing individualized details. To enhance facial features, TADA employs a partial mesh subdivision. This technique refines the mesh structure while maintaining a uniform distribution of vertices and smoother skinning weights. The optimization process is a key step in the workflow. It involves aligning the geometry and texture of the avatars. This is achieved through a clever combination of SDS losses, which blend information from normal and RGB images. By jointly optimizing these aspects, TADA ensures that the avatars possess both a realistic appearance and coherent structure.
Experimental analysis
The effectiveness of TADA is rigorously evaluated through a combination of qualitative and quantitative comparisons with existing methods. In both full-body and head avatar generation, TADA stands out by producing avatars with realistic textures, diverse body shapes, and seamless alignment between geometry and texture. The method's superiority is validated through a user study, which affirms its excellence in geometry quality, texture fidelity, and alignment with input descriptions. Ablation studies further enhance the understanding of TADA's key components. These studies investigate the contributions of the geometry consistency loss and animation training, shedding light on their roles in improving the performance of the method.
The practical applications of TADA are vast and impactful. It finds utility in virtual try-on scenarios, allowing avatars tailored to individual fashion preferences. Texture editing becomes intuitive, facilitating rapid design modifications. Moreover, TADA empowers users to manipulate specific parts of avatars seamlessly. Despite its strengths, the method faces challenges in relighting discrepancies across various environments and potential biases in character generation. As TADA evolves, ethical considerations take center stage. Addressing concerns like deep fakes, intellectual property rights, gender diversity, and cultural inclusivity becomes integral to its responsible advancement.
Conclusion
To sum up, this present paper presents TADA, which generates high-quality, animatable 3D avatars exclusively from textual descriptions. This method covers a diverse range of individuals, including celebrities and custom characters, and seamlessly integrates into various industries. The approach involves using a subdivided version of SMPL-X with learned displacement and UV texture, hierarchical optimization with adaptive focal lengths, geometric consistency loss for geometry-texture alignment, and animation training for semantic correspondence with SMPL-X. Ablation studies and comprehensive results highlight TADA's superiority over existing methods in both qualitative and quantitative aspects.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.