Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology

In an article recently published in the journal Scientific Reports, researchers proposed a new sentence-level lip-to-speech (LTS) synthesis architecture, designated as flash attention generative adversarial network (FA-GAN) and investigated its effectiveness in the Chinese LTS synthesis domain.

Study: FA-GAN for Enhanced Lip-to-Speech Technology. Image credit: BestForBest/Shutterstock
Study: FA-GAN for Enhanced Lip-to-Speech Technology. Image credit: BestForBest/Shutterstock

Background

LTS generation is a rapidly evolving emerging technology that offers a new tool of communication for deaf or speech-impaired individuals and plays a crucial role in the education field. LTS generation can be used to improve the oral expression and speech articulation of learners and speech interaction in robots and virtual assistants. However, the field of LTS faces several challenges, including poorly recognized Chinese LTS generation.

Additionally, the extensive lip-speaking variation is poorly aligned with lip movements, which is also a major challenge. In practical applications, these challenges can lead to reduced accuracy in lip reading, specifically when sensitive to lexical or syllabic meanings. The synthesized speech can mismatch with the current context due to brief lip movement, missing context, homophones, and noise, resulting in distortion in speech synthesis.

In speech synthesis, accuracy and quality can be ensured using more efficient methods for long video sequence processing and image representation quality handling. Overcoming these challenges is necessary for improving communication abilities, providing a better quality of life to people with disabilities, and advancing LTS technology.

Existing LTS generation techniques typically use the generative adversarial network (GAN) architecture. However, insufficient joint modeling of global and local lip movements that result in inadequate image representations and visual ambiguities is a major issue in these GAN-based LTS techniques.

The proposed approach

In this study, researchers designed and introduced the FA-GAN deep architecture to effectively address existing challenges in the Chinese LTS generation domain. In this architecture, the audio and vision were separately coded, and the lip motions/global and local lip movements were jointly modeled to enhance speech recognition accuracy.

The joint modeling approach enables the model to better comprehend the relationship between audio and lip movements, obtaining a richer visual context, which enhances speech synthesis quality. This is suitable for understanding inconsistent or blurry movements, resulting in higher accuracy in the predictions of correct pronunciation.

A multilevel Swin-transformer and a hierarchical iterative generator were introduced for improved image representation/visual feature extraction and speech generation, respectively. Specifically, the hierarchical iterative generator refined the speech generation by focusing on different audio stages' variations and features to significantly increase recognition rates and generate speech that closely resembles real speech.

Additionally, an FA mechanism was also incorporated to enhance the computational efficiency, which augmented the model performance and streamlined/reduced the computational burden from several interactions of the iterative generator. This FA mechanism automatically learns the weights of each modality (image and audio) to facilitate better interaction and information transfer between them.

Thus, the approach effectively models the temporal relationship between images and audio. The Mel spectrogram was treated as an image, and a two-dimensional (2D) GAN was employed to train the model to efficiently handle the fusion and alignment between video and audio for realistic speech generation.

Evaluation and findings

Researchers evaluated the proposed FA-GAN on several datasets, including CN-CVS and Ground Re-IDentification (GRID). Currently, CN-CVS is the most populous and largest available multimodal Chinese dataset, with over 200,000 data entries, 2500 speakers, and an overall duration exceeding 300 hours. In this study, single-speaker data were only utilized for correction, and the model was assessed under four distinct environmental conditions.

The English dataset GRID contained audio and video recordings from hundreds of speakers, displaying their different speech patterns and mouth shapes while articulating phrases and words. Thus, the dataset provides comprehensive multimodal annotations, including speech transcripts in the audio and lip positions in the video.

Several evaluation metrics, including word error rate (WER), perceptual evaluation of speech quality (PESQ), extended short-time objective intelligibility (ESTOI), and short-time objective intelligibility (STOI), were used for the comparative experiments.

On the English dataset GRID, the proposed FA-GAN framework outperformed other models by achieving the highest level based on the ESTOI metric. However, the Lip2Wav model displayed slightly better performance than the proposed model based on the STOI metric, achieving a score of 0.731, while FA-GAN achieved a score of 0.724 on the GRID dataset.

Similarly, the FA-GAN model demonstrated the second-best performance based on the WER and PESQ metrics on the GRID dataset, slightly lagging behind the best model, VCA-GAN, on both metrics. Specifically, the VCA-GAN achieved the lowest WER of 12.25%, followed by the FA-GAN with 12.67% WER.

On the CN-CVS Chinese dataset, the FA-GAN model outperformed all models evaluated in this study, including VCA-GAN, VAE-based, and GAN-based models, in all metrics. For instance, the proposed model achieved the lowest WER of 43.19% on CN-CVS, which was also significantly lower compared to the second-lowest WER of 49.70% achieved by VCA-GAN.

To summarize, the findings of this study demonstrated that the proposed FA-GAN architecture is a better alternative than current Chinese Mandarin sentence-level LTS synthesis frameworks in STOI and ESTOI metrics and existing English sentence-level LTS synthesis frameworks in the ESTOI metric.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2024, March 04). Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology. AZoAi. Retrieved on December 22, 2024 from https://www.azoai.com/news/20240304/Flash-Attention-Generative-Adversarial-Network-for-Enhanced-Lip-to-Speech-Technology.aspx.

  • MLA

    Dam, Samudrapom. "Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology". AZoAi. 22 December 2024. <https://www.azoai.com/news/20240304/Flash-Attention-Generative-Adversarial-Network-for-Enhanced-Lip-to-Speech-Technology.aspx>.

  • Chicago

    Dam, Samudrapom. "Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology". AZoAi. https://www.azoai.com/news/20240304/Flash-Attention-Generative-Adversarial-Network-for-Enhanced-Lip-to-Speech-Technology.aspx. (accessed December 22, 2024).

  • Harvard

    Dam, Samudrapom. 2024. Flash Attention Generative Adversarial Network for Enhanced Lip-to-Speech Technology. AZoAi, viewed 22 December 2024, https://www.azoai.com/news/20240304/Flash-Attention-Generative-Adversarial-Network-for-Enhanced-Lip-to-Speech-Technology.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
SPARRO Framework Enhances Promptology for Ethical AI Use in Higher Education