In a paper published in the journal Future Internet, researchers addressed the conventional 3D animation creation process, which involves motion acquisition, dubbing, and mouth movement data binding for each character. The proposed solution involves the integration of artificial intelligence (AI) with a motion capture system, which aims to streamline animation creation by reducing time, workload, and costs.
Using AI and natural language processing (NLP) allows characters to independently learn and generate responses. This departure from the traditional approach involves moving away from predefined digital character behaviors. Implementing this approach within the animation environment of the digital person involved using Unity plug-ins. These plug-ins were utilized to control mouth Blendshape, synchronize voice and mouth movements, and connect the digital person to an AI system. This integration empowers AI-driven language interactions in animation production. In the experiments, the researchers evaluated the accuracy of natural language interactions and real-time mouth movement synchronization. They also explored the potential for singularity in guiding users during digital human animation creation and assessed its ability to guide user interactions through its thought process.
Background
AI has revolutionized human-computer interactions using language models like Chat Generative Pre-trained Transformer (ChatGPT). ChatGPT excels in real-time chat applications, which includes text generation and code writing. The capabilities of ChatGPT span various industries, including medical health, mathematics, automotive, and education, enhancing productivity and enabling interactive experiences with digital humans. This innovative integration streamlines animation creation by combining ChatGPT with speech recognition and synthesis, improving the interaction between creators and digital characters.
Related work
In previous works, the research delved into animation creation and character movement by laying the groundwork for integrating AI into character interactions. Previous studies looked closely at using automatic speech recognition (ASR) and text-to-speech (TTS) technologies with Application Programming Interfaces (API) on local systems and custom voice models. The limitations in voice synthesis were found in the rigidity of pre-set voice models and the complexities of API processes. Ultimately, the research favored ChatGPT's API for voice synthesis.
Proposed method
The subsequent phase involves establishing a conducive environment for seamless data exchange with external API after the controls for hand, foot, mouth, and eye movements have been configured. This environment empowers the digital human to engage with external API and proficiently manage data throughout the animation creation process by ensuring a streamlined and efficient workflow.
The implementation involves using C# for API access related to ChatGPT, ASR, and TTS for digital characters. In Unity, a dedicated object called AI-Turbo is created to manage the dialog interaction with the OpenAI GPT-3.5 Turbo model through API calls. Additionally, C# scripts are created to handle ASR and TTS functionalities. ASR controls voice recording and recognition, while TTS uses Microsoft Azure for text-to-speech conversion. These API calls are integrated into the chat interface for a seamless interaction experience and to enhance the animation creation process in Unity.
Experimental analysis
The users can choose between text or voice input while interacting with the digital character. The voice input option lets users initiate recording by long-pressing the gray "Recording" button. The recorded voice is then processed to the automatic speech recognition (ASR) server. The ASR server converts the voice data into text and displays it on the UI interface. Simultaneously, the corresponding text information is sent to the OpenAI server for generating a text reply.
The generated reply from ChatGPT is shown on the UI interface, and the text information is sent to the text-to-speech (TTS) server. The TTS server converts the text into voice data and synchronizes the mouth movements of the digital character to match the speech. The final result is presented to the user. The ASR processing step is skipped in the case of text input, and the text is sent directly to the OpenAI server to generate a reply.
It is crucial to conduct tests to evaluate the effectiveness and efficiency of the ASR and TTS systems after setting up the environment for the ChatGPT digital character. The accuracy of ChatGPT's API text responses and the success rate of response delivery should be assessed. Additionally, the real-time synchronization between the mouth movements of the digital character and the corresponding voice information needs testing. Analyzing the experimental data will yield meaningful insights.
The testing phase evaluated different aspects, including ASR and TTS API connections, ChatGPT's API replies, and the real-time synchronization of the digital character's mouth movements with speech. Overall, the tests aimed to ensure the proper functioning and performance of the system components. The consistency and reliability of experimental results need further optimization and improvement, especially considering the influence of network stability and complex questions in the testing environment.
Conclusion
To summarize, this paper primarily focused on integrating ASR and TTS technologies with digital humans through Unity and incorporating ChatGPT for natural language interactions. The aim was to help animation creators with tasks like script writing and sub-shot design while testing API response times and animation interactions.
The research gathered insights from various communities and platforms to enhance the understanding of technology applications. The integration of ChatGPT within a system that combines multiple technologies proved advantageous for animation creation, although its responses were not always flawless. Balancing expectations is crucial during the process of assessing ChatGPT's outputs. Future focus will be on optimizing the system to enhance ASR and TTS stability, thus making it a valuable creative assistant. Additionally, plans include expanding its use in diverse applications like virtual tour guides and non-playable characters in games.