In an article recently published in the journal Scientific Reports, researchers proposed a new sentence-level lip-to-speech (LTS) synthesis architecture, designated as flash attention generative adversarial network (FA-GAN) and investigated its effectiveness in the Chinese LTS synthesis domain.
Background
LTS generation is a rapidly evolving emerging technology that offers a new tool of communication for deaf or speech-impaired individuals and plays a crucial role in the education field. LTS generation can be used to improve the oral expression and speech articulation of learners and speech interaction in robots and virtual assistants. However, the field of LTS faces several challenges, including poorly recognized Chinese LTS generation.
Additionally, the extensive lip-speaking variation is poorly aligned with lip movements, which is also a major challenge. In practical applications, these challenges can lead to reduced accuracy in lip reading, specifically when sensitive to lexical or syllabic meanings. The synthesized speech can mismatch with the current context due to brief lip movement, missing context, homophones, and noise, resulting in distortion in speech synthesis.
In speech synthesis, accuracy and quality can be ensured using more efficient methods for long video sequence processing and image representation quality handling. Overcoming these challenges is necessary for improving communication abilities, providing a better quality of life to people with disabilities, and advancing LTS technology.
Existing LTS generation techniques typically use the generative adversarial network (GAN) architecture. However, insufficient joint modeling of global and local lip movements that result in inadequate image representations and visual ambiguities is a major issue in these GAN-based LTS techniques.
The proposed approach
In this study, researchers designed and introduced the FA-GAN deep architecture to effectively address existing challenges in the Chinese LTS generation domain. In this architecture, the audio and vision were separately coded, and the lip motions/global and local lip movements were jointly modeled to enhance speech recognition accuracy.
The joint modeling approach enables the model to better comprehend the relationship between audio and lip movements, obtaining a richer visual context, which enhances speech synthesis quality. This is suitable for understanding inconsistent or blurry movements, resulting in higher accuracy in the predictions of correct pronunciation.
A multilevel Swin-transformer and a hierarchical iterative generator were introduced for improved image representation/visual feature extraction and speech generation, respectively. Specifically, the hierarchical iterative generator refined the speech generation by focusing on different audio stages' variations and features to significantly increase recognition rates and generate speech that closely resembles real speech.
Additionally, an FA mechanism was also incorporated to enhance the computational efficiency, which augmented the model performance and streamlined/reduced the computational burden from several interactions of the iterative generator. This FA mechanism automatically learns the weights of each modality (image and audio) to facilitate better interaction and information transfer between them.
Thus, the approach effectively models the temporal relationship between images and audio. The Mel spectrogram was treated as an image, and a two-dimensional (2D) GAN was employed to train the model to efficiently handle the fusion and alignment between video and audio for realistic speech generation.
Evaluation and findings
Researchers evaluated the proposed FA-GAN on several datasets, including CN-CVS and Ground Re-IDentification (GRID). Currently, CN-CVS is the most populous and largest available multimodal Chinese dataset, with over 200,000 data entries, 2500 speakers, and an overall duration exceeding 300 hours. In this study, single-speaker data were only utilized for correction, and the model was assessed under four distinct environmental conditions.
The English dataset GRID contained audio and video recordings from hundreds of speakers, displaying their different speech patterns and mouth shapes while articulating phrases and words. Thus, the dataset provides comprehensive multimodal annotations, including speech transcripts in the audio and lip positions in the video.
Several evaluation metrics, including word error rate (WER), perceptual evaluation of speech quality (PESQ), extended short-time objective intelligibility (ESTOI), and short-time objective intelligibility (STOI), were used for the comparative experiments.
On the English dataset GRID, the proposed FA-GAN framework outperformed other models by achieving the highest level based on the ESTOI metric. However, the Lip2Wav model displayed slightly better performance than the proposed model based on the STOI metric, achieving a score of 0.731, while FA-GAN achieved a score of 0.724 on the GRID dataset.
Similarly, the FA-GAN model demonstrated the second-best performance based on the WER and PESQ metrics on the GRID dataset, slightly lagging behind the best model, VCA-GAN, on both metrics. Specifically, the VCA-GAN achieved the lowest WER of 12.25%, followed by the FA-GAN with 12.67% WER.
On the CN-CVS Chinese dataset, the FA-GAN model outperformed all models evaluated in this study, including VCA-GAN, VAE-based, and GAN-based models, in all metrics. For instance, the proposed model achieved the lowest WER of 43.19% on CN-CVS, which was also significantly lower compared to the second-lowest WER of 49.70% achieved by VCA-GAN.
To summarize, the findings of this study demonstrated that the proposed FA-GAN architecture is a better alternative than current Chinese Mandarin sentence-level LTS synthesis frameworks in STOI and ESTOI metrics and existing English sentence-level LTS synthesis frameworks in the ESTOI metric.
Journal reference:
- Yang, Q., Bai, Y., Liu, F., Zhang, W. (2024). Integrated visual transformer and flash attention for lip-to-speech generation GAN. Scientific Reports, 14(1), 1-12. https://doi.org/10.1038/s41598-024-55248-6, https://www.nature.com/articles/s41598-024-55248-6