Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology

Download PDF Copy

By Samudrapom DamReviewed by Susha Cheriyedath, M.Sc.Mar 4 2024

In an article recently published in the journal Scientific Reports, researchers proposed a new sentence-level lip-to-speech (LTS) synthesis architecture, designated as flash attention generative adversarial network (FA-GAN) and investigated its effectiveness in the Chinese LTS synthesis domain.

*Study: FA-GAN for Enhanced Lip-to-Speech Technology. Image credit: BestForBest/Shutterstock*

Background

LTS generation is a rapidly evolving emerging technology that offers a new tool of communication for deaf or speech-impaired individuals and plays a crucial role in the education field. LTS generation can be used to improve the oral expression and speech articulation of learners and speech interaction in robots and virtual assistants. However, the field of LTS faces several challenges, including poorly recognized Chinese LTS generation.

Additionally, the extensive lip-speaking variation is poorly aligned with lip movements, which is also a major challenge. In practical applications, these challenges can lead to reduced accuracy in lip reading, specifically when sensitive to lexical or syllabic meanings. The synthesized speech can mismatch with the current context due to brief lip movement, missing context, homophones, and noise, resulting in distortion in speech synthesis.

In speech synthesis, accuracy and quality can be ensured using more efficient methods for long video sequence processing and image representation quality handling. Overcoming these challenges is necessary for improving communication abilities, providing a better quality of life to people with disabilities, and advancing LTS technology.

Existing LTS generation techniques typically use the generative adversarial network (GAN) architecture. However, insufficient joint modeling of global and local lip movements that result in inadequate image representations and visual ambiguities is a major issue in these GAN-based LTS techniques.

The proposed approach

In this study, researchers designed and introduced the FA-GAN deep architecture to effectively address existing challenges in the Chinese LTS generation domain. In this architecture, the audio and vision were separately coded, and the lip motions/global and local lip movements were jointly modeled to enhance speech recognition accuracy.

The joint modeling approach enables the model to better comprehend the relationship between audio and lip movements, obtaining a richer visual context, which enhances speech synthesis quality. This is suitable for understanding inconsistent or blurry movements, resulting in higher accuracy in the predictions of correct pronunciation.

A multilevel Swin-transformer and a hierarchical iterative generator were introduced for improved image representation/visual feature extraction and speech generation, respectively. Specifically, the hierarchical iterative generator refined the speech generation by focusing on different audio stages' variations and features to significantly increase recognition rates and generate speech that closely resembles real speech.

Additionally, an FA mechanism was also incorporated to enhance the computational efficiency, which augmented the model performance and streamlined/reduced the computational burden from several interactions of the iterative generator. This FA mechanism automatically learns the weights of each modality (image and audio) to facilitate better interaction and information transfer between them.

Thus, the approach effectively models the temporal relationship between images and audio. The Mel spectrogram was treated as an image, and a two-dimensional (2D) GAN was employed to train the model to efficiently handle the fusion and alignment between video and audio for realistic speech generation.

Evaluation and findings

Researchers evaluated the proposed FA-GAN on several datasets, including CN-CVS and Ground Re-IDentification (GRID). Currently, CN-CVS is the most populous and largest available multimodal Chinese dataset, with over 200,000 data entries, 2500 speakers, and an overall duration exceeding 300 hours. In this study, single-speaker data were only utilized for correction, and the model was assessed under four distinct environmental conditions.

The English dataset GRID contained audio and video recordings from hundreds of speakers, displaying their different speech patterns and mouth shapes while articulating phrases and words. Thus, the dataset provides comprehensive multimodal annotations, including speech transcripts in the audio and lip positions in the video.

Several evaluation metrics, including word error rate (WER), perceptual evaluation of speech quality (PESQ), extended short-time objective intelligibility (ESTOI), and short-time objective intelligibility (STOI), were used for the comparative experiments.

On the English dataset GRID, the proposed FA-GAN framework outperformed other models by achieving the highest level based on the ESTOI metric. However, the Lip2Wav model displayed slightly better performance than the proposed model based on the STOI metric, achieving a score of 0.731, while FA-GAN achieved a score of 0.724 on the GRID dataset.

Similarly, the FA-GAN model demonstrated the second-best performance based on the WER and PESQ metrics on the GRID dataset, slightly lagging behind the best model, VCA-GAN, on both metrics. Specifically, the VCA-GAN achieved the lowest WER of 12.25%, followed by the FA-GAN with 12.67% WER.

On the CN-CVS Chinese dataset, the FA-GAN model outperformed all models evaluated in this study, including VCA-GAN, VAE-based, and GAN-based models, in all metrics. For instance, the proposed model achieved the lowest WER of 43.19% on CN-CVS, which was also significantly lower compared to the second-lowest WER of 49.70% achieved by VCA-GAN.

To summarize, the findings of this study demonstrated that the proposed FA-GAN architecture is a better alternative than current Chinese Mandarin sentence-level LTS synthesis frameworks in STOI and ESTOI metrics and existing English sentence-level LTS synthesis frameworks in the ESTOI metric.

Journal reference:

Yang, Q., Bai, Y., Liu, F., Zhang, W. (2024). Integrated visual transformer and flash attention for lip-to-speech generation GAN. Scientific Reports, 14(1), 1-12. https://doi.org/10.1038/s41598-024-55248-6, https://www.nature.com/articles/s41598-024-55248-6

Posted in: AI Research News

Comments (0)

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Dam, Samudrapom. (2024, March 04). Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology. AZoAi. Retrieved on July 09, 2025 from https://www.azoai.com/news/20240304/Flash-Attention-Generative-Adversarial-Network-for-Enhanced-Lip-to-Speech-Technology.aspx.
MLA
Dam, Samudrapom. "Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology". AZoAi. 09 July 2025. <https://www.azoai.com/news/20240304/Flash-Attention-Generative-Adversarial-Network-for-Enhanced-Lip-to-Speech-Technology.aspx>.
Chicago
Dam, Samudrapom. "Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology". AZoAi. https://www.azoai.com/news/20240304/Flash-Attention-Generative-Adversarial-Network-for-Enhanced-Lip-to-Speech-Technology.aspx. (accessed July 09, 2025).
Harvard
Dam, Samudrapom. 2024. Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology. AZoAi, viewed 09 July 2025, https://www.azoai.com/news/20240304/Flash-Attention-Generative-Adversarial-Network-for-Enhanced-Lip-to-Speech-Technology.aspx.