Make-An-Animation: Generating Human Motions from Textual Descriptions

In an article recently submitted to the arxiv* server, researchers discussed an innovative project "Make-An-Animation”, which focuses on generating life-like human motions from textual descriptions. 

Study: Make-An-Animation: Generating Human Motions from Textual Descriptions. Image credit: Frame Stock Footage/Shutterstock
Study: Make-An-Animation: Generating Human Motions from Textual Descriptions. Image credit: Frame Stock Footage/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

The study highlights the applications of text-guided human motion in advancing robotics. Make-An-Animation generates different human models and poses related to the text or command entered and creates better prototypes than the previous methods.

The model is trained in two steps: the first step comprises training the model with several pseudo pose pairs and texts obtained from image-text datasets, and the second step involves tuning the captured data and including additional layers to the temporal dimension. Make-An-Animation uses the aspects of U-Net architecture, identical to text-to-video generation models, which makes the model unique compared to other generation models.

Background

Traditional approaches to generating human-like motions from text descriptions have faced limitations, primarily due to the reliance on motion capture data, which is often expensive and limited in scope. These limitations result in suboptimal performance when dealing with real-world text prompts, as there are numerous ways to describe a particular movement, making it difficult to accurately match textual descriptions with specific poses. 

Core to the approach, the datasets of poses and movements must not be limited. Thus, it is necessary to have large-scale “in-the-wild-image” text datasets, which significantly improves the result. The term “in the wild” refers to the movements or activities carried out normally by a human being; placing all these together gives an essence of various similar-to-human poses. 

About the Study

The "Make-An-Animation" model, developed by Meta AI, offers a groundbreaking solution to these challenges. Unlike traditional methods, which heavily depend on motion capture data, this model leverages large-scale image-text datasets containing a staggering 35 million pairs of textual descriptions and static poses. This vast dataset allows the model to learn and generate static 3D body poses. 

The researchers used two different datasets for training the model:

3-D Human motion datasets: The AMASS dataset of 3D motions and the textual annotations obtained from the HumanML3D dataset are used in this model. The researchers have chosen Skinned Multi-Person Linear Model (SMPL) annotations from AMASS rather than the computed data from the HumanML3D dataset because of the redundancy and the extra optimization and monitoring required to convert the representations without degrading the quality of the SMPL format. They also doubled the dataset dimensions by mirroring the motions and adjusting the textual descriptions, which resulted in 26,850 examples. Furthermore, they used the GTA-Human dataset, which provides SMPL annotations for 20,000 samples from an action game. Although it does not contain any textual annotations, the researchers utilized it for unconditional training of the model alongside conditional training.

Large-scale Text Pseudo-Pose dataset (TPP): The dataset consists of 35M human pose pairs and the text descriptions related to the poses, which were taken from some of the large-scale image-text datasets. They used a Detectron2 key point detector to search single human images and then extracted the 3D pseudo-pose SMPL annotations with the help of a pre-processed PyMAF-X model. The advantage of using large-scale data is that it helps control the limitations of prevailing datasets by providing varied human poses and a vast number of sample pairs consisting of text descriptions and 3D poses.

Results/Discussions

The researchers draw an analogy between their motion generation model's performance and the prevailing generation models.

Automatic Metrics: An automatic evaluation was conducted on the HumanML3D test set using three metrics: Fréchet Distance (FID), R-Precision, and Diversity scores. FID measures the quality and diversity of generated samples compared to a ground truth set, R-Precision assesses faithfulness between generated motions and input prompts, and Diversity score measures variation across generated motions.

Human Evaluation: An evaluation of a model's generalization ability in synthesizing challenging human poses and motions was conducted. The authors collected 400 prompts from Amazon Mechanical Turk, which described actions and scene context while filtering out inappropriate content. These prompts evaluated multiple models' ability to generate animated poses.

Ablation Studies: These include an experiment aimed at understanding the impact of the U-Net architecture and a dataset called TPP on the results of a model. They implemented an alternative diffusion model as a decoder-only transformer, trained on tokenized captions and other inputs to generate 3D pose representations from text. The model was pre-trained on the TPP dataset and then adapted for motion synthesis.

Qualitative Studies: These involve the evaluation of a model's generalization capability in generating novel motions by performing a nearest neighbor search to identify similar prompts or motions in the training data for each synthesized motion. The researchers used motion and text embeddings from a pre-trained motion CLIP model. The results show that their model has successfully generalized to create new motions not present in the training data. It also demonstrates the diversity of motions generated by their model for various text prompts.

Limitations, Conclusion, and Future Scope

To summarize, current text-to-motion models use small-scale motion capture datasets, resulting in poor performance on diverse prompts. To address this issue, the authors introduced Make-An-Animation, a human motion generation model trained on a large-scale dataset containing static pseudo-pose and motion capture data. They demonstrated that pre-training on this dataset significantly enhances the model's performance on captions that go beyond the typical motion capture data distribution. Additionally, they introduced a U-Net architecture that facilitates a smooth transition from static pose pre-training to dynamic pose fine-tuning. Overall, this work opens the door to large-scale image and video datasets for producing human 3D pose parameters.

Furthermore, this technology opens seamless possibilities in various sectors like the cinema and video game industries and augmented and virtual reality solutions. In the future, it could help significantly advance motion capture technology and revolutionize the animation industry.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Susha Cheriyedath

Written by

Susha Cheriyedath

Susha is a scientific communication professional holding a Master's degree in Biochemistry, with expertise in Microbiology, Physiology, Biotechnology, and Nutrition. After a two-year tenure as a lecturer from 2000 to 2002, where she mentored undergraduates studying Biochemistry, she transitioned into editorial roles within scientific publishing. She has accumulated nearly two decades of experience in medical communication, assuming diverse roles in research, writing, editing, and editorial management.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Cheriyedath, Susha. (2023, October 05). Make-An-Animation: Generating Human Motions from Textual Descriptions. AZoAi. Retrieved on November 24, 2024 from https://www.azoai.com/news/20231005/Make-An-Animation-Generating-Human-Motions-from-Textual-Descriptions.aspx.

  • MLA

    Cheriyedath, Susha. "Make-An-Animation: Generating Human Motions from Textual Descriptions". AZoAi. 24 November 2024. <https://www.azoai.com/news/20231005/Make-An-Animation-Generating-Human-Motions-from-Textual-Descriptions.aspx>.

  • Chicago

    Cheriyedath, Susha. "Make-An-Animation: Generating Human Motions from Textual Descriptions". AZoAi. https://www.azoai.com/news/20231005/Make-An-Animation-Generating-Human-Motions-from-Textual-Descriptions.aspx. (accessed November 24, 2024).

  • Harvard

    Cheriyedath, Susha. 2023. Make-An-Animation: Generating Human Motions from Textual Descriptions. AZoAi, viewed 24 November 2024, https://www.azoai.com/news/20231005/Make-An-Animation-Generating-Human-Motions-from-Textual-Descriptions.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
AI Framework Transforms Scene Representation with Precise, Editable 3D and 4D Visuals