In a paper published in the journal Displays, researchers have presented an advanced method for generating synchronized talking face videos driven by speech audio by introducing an innovative approach utilizing generative adversarial networks (GANs) and time-frequency features extracted from the audio. Their system includes an audio time series encoder with a Multi-level Wavelet Transform (MWT) to transform speech audio into different frequency domains, enhancing the realism of the generated video frames.
Additionally, they incorporated a smooth dynamic time-warping formulation in their time series discriminator, resulting in improved synchronization. Experimental results on various datasets showcased substantial performance enhancements, demonstrating the method's capability to produce acoustically synchronized talking faces with greater fidelity.
Context and Previous Research
The significance of visual signals in communication, particularly for those with hearing disabilities and in noisy settings, underscores the need for accurate synchronization between speech audio and generated video frames. This growing field aims to enhance synchronization for various applications, such as education, entertainment, and photography.
Previous works have addressed the denoising of speech audio using wavelet transform, which effectively distinguishes signal from noise. Time-frequency analysis methods, including MWT, have been explored to improve the accuracy of speech signal analysis. Prior research has mapped speech features directly to video frames in talking face generation, but challenges persist in achieving precise synchronization between audio and video elements.
TF2 Method Components Overview
The proposed method comprehensively details the TF2 method for generating synchronized talking faces driven by audio input. The discussion begins with the dynamic threshold wavelet denoising approach, emphasizing the importance of mitigating noise in audio signals for accurate speech feature extraction. The dynamic threshold denoising method, which adapts its threshold value to different signal-to-noise ratios (SNR) scenarios, is introduced to reduce the influence of noise.
The generator sub-network, a crucial component of TF2, is then presented, consisting of multiple encoders and decoders. It includes an audio time series encoder that employs MWT, 2D convolutional blocks, and a GRU to extract tempo-semantic correlation features from audio clips. The audio semantic encoder captures the audio's semantic features, while the image encoder extracts visual features from authentic sample images. These extracted features are combined into a single feature space to decode and generate synchronized talking face video frames.
Using U-Net architecture with skip connections, the video frame decoder upsamples the fused feature information to generate clear and realistic speaker facial images. The video frame discriminator enhances the quality of the generated video frames by calculating losses between actual and rendered frames, achieving this through using the Least Squares GAN (LSGAN) framework to improve image clarity.
To ensure sequential accuracy in generating video frames, the time series discriminator, based on Conditional GAN (CGAN), compares the tempo-semantic correlation features extracted from the generator with the time series data of the generated video frames. It ensures the sequence of rendered video frames is coherent and correctly synchronized with the audio content.
The objective functions in TF2 focus on different aspects of video generation. LSGAN enhances the clarity and authenticity of generated video frames. CGAN ensures the realistic order of the rendered video frames and synchronizes audio with video frames. Finally, researchers introduce mean square error (MSE) to reduce the differences between the generated video frames and real ones, improving the quality of the talking face video.
Findings
In quantitative results, researchers demonstrate TF2's performance on benchmark datasets, including LRW (Lip Reading in the Wild), VoxCeleb2, and Ground Re-IDentification (GRID) dataset. These results underscore TF2's superiority with higher Peak SNR (PSNR) and SSIM (Structural Similarity) values, lower LMD (Landmarks Distance), and improved Synccon f, indicating its capability to generate clear and realistic talking face videos with precise mouth movement synchronization.
In qualitative results, an experiment conducted on the LRW dataset illustrates TF2's performance across different training durations. This visual analysis showcases the transformation of initially blurry video frames into more authentic and high-quality representations, highlighting the effectiveness of prolonged training. Furthermore, researchers explore TF2's generalization ability, demonstrating that it can generate clear video frames with matching mouth shapes for the same utterance across different subjects and themes and showcase its adaptability across various scenarios.
The Ablation Study analyzes crucial components of TF2, including the impact of audio denoising, the effectiveness of different wavelet transformation levels, and the significance of various loss functions. These experiments confirm that denoising audio enhances video quality, that the three-level wavelet transformation method is most effective, and that combining multiple loss functions optimizes TF2's performance.
In the User Study, volunteers objectively evaluate the quality of generated talking face videos by assessing authenticity, temporal coherence, mouth movement synchronization, and video realism. TF2 outperforms other methods, achieving higher percentages of temporal coherence, mouth movement synchronization, and video realism.
Summary
To sum up, the TF2 method, based on GAN, generates synchronized and clear talking face videos from face images and speech audio. It incorporates a speech time series encoder featuring MWT and GRU for tempo-semantic correlation extraction and optimizes the video frame order. The time series discriminator leverages Soft Dynamic Time Warping (soft-DTW) to enhance accuracy. TF2's quantitative evaluation demonstrates its proficiency in creating synchronized videos on LRW, VoxCeleb2, and GRID datasets.