Enhancing Audio-Visual Speech Recognition with Cross-Modal Fusion

In an article recently submitted to the ArXiv* server, researchers proposed a new approach to improve the audio-visual speech recognition (AVSR) systems using cross-modal fusion encoder (CMFE) and visual pre-training based on lip-subword correlation.

Study: Enhancing Audio-Visual Speech Recognition with Cross-Modal Fusion. Image credit: panuwat phimpha/Shutterstock
Study: Enhancing Audio-Visual Speech Recognition with Cross-Modal Fusion. Image credit: panuwat phimpha/Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

AVSR is primarily a multi-modality application based on the bi-modal/auditory and visual nature of perception during speech communication between humans. In this application, lip movement is used as a complementary modality to enhance automatic speech recognition (ASR) performance.

Previously, handcrafted lip features were extracted and included in hybrid ASR systems. However, end-to-end AVSR (E2E-AVSR) systems have received significant attention in recent years owing to their design simplicity and extensive availability of public audio-visual databases. Although the E2E-AVSR systems have demonstrated their effectiveness on several parameters, several issues still remain that must be addressed to make them feasible for practical applications.

For instance, far-field low-quality videos cannot significantly improve AVSRs' performance under a common end-to-end framework compared to audio-only ASR systems. Additionally, a similar degradation in performance occurs from a uni-modal network to a multi-modal network.

Learning a large integrated neural network in a multi-modal model is more challenging compared to a uni-modal model due to specialized input representations and exceptional convergence rates between two/audio and visual modalities. Pre-training techniques can be used to decouple the end-to-end one-pass training framework in two stages, which involves pre-training the uni-modal networks and then integrating them into a fusion model after unified fine-tuning.

Thus, this strategy can effectively address the learning dynamic variations between modalities and exploit their interactions. Identifying the methods to pre-train the visual front-end is a critical aspect of the decoupled training framework. Although a pre-trained visual front-end can be directly used as a frozen visual embedding extractor, the approach can only lead to small improvement owing to the various feature distributions across the target and source domains.

In several studies, researchers used an isolated word recognition task algorithm for visual front-end pre-training and then fine-tuned the front end using the AVSR model. However, data recordings of paired labels and isolated words are required in these algorithms, which cannot be collected easily on a large scale.

In a recent study, self-supervised learning was leveraged for AVSR on unlabeled large-scale data sets. Although such pre-training methods slightly improve the performances of the AVSR system, they use a large amount of additional unlabeled and labeled data.

Improving AVSR performance using a new approach

In this paper, researchers proposed two novel techniques to improve AVSR under a pre-training and fine-tuning training framework. Initially, researchers investigated the correlation between syllable-level subword units in Mandarin and lip shapes to produce good syllable boundaries at the frame level from lip shapes to realize precise alignment of audio and video streams during crossmodal fusion and visual model pre-training.

Subsequently, researchers proposed an audio-guided CMFE neural network to use the main training parameters for several cross-modal attention layers to fully leverage the modality complementarity. The proposed subword-correlated visual pretraining technique in this study did not require additional manually labeled word boundaries or data.

A set of hidden Markov models (HMM) was trained using the Gaussian mixture model (GMM-HMMs) on far-field audio to create frame-level alignment labels and pre-train visual front-end by identifying syllable-related HMM states corresponding to every visual frame.

Moreover, the fine-grained alignment labels guided the network to focus on learning the extraction of visual features of low-quality videos. Thus, the proposed pre-training method explicitly offered syllable boundaries to produce a frame-level direct mapping from lip shapes to Mandarin syllables, which differed from the end-to-end continuous lip-reading-based pre-training method.

The proposed method can also be considered a cross-modal conversion process accepting video frames as inputs and generating acoustic subword sequences as outputs. Additionally, the acoustic information generated from lip movements can contribute to an efficient adaptation process in the fusion stage with the audio stream.

In the decoupled training fusion stage, the initialized visual and audio branches possessed the uni-modal representation extraction ability, which allowed the use of more training parameters to build modality fusion models. Several crossmodal fusions occurred at various layers in the audio modality-dominated novel CMFE.

Significance of the study

The final AVSR system based on the proposed techniques demonstrated a better performance on the Multimodal Information Based Speech Processing 2021 (MISP2021)-AVSR data set compared to state-of-the-art systems without using additional training data and complex back-ends and front-ends, which indicated the effectiveness of these two techniques.

The AVSR system comprised of weighted prediction error (WPE) and guided source separation (WPE+GSS) in the front-end and CMFE as the backend encoder and initialized using 500 hours of training data for the audio branch and no extra training data for the visual branch attained 24.58% character error rate (CER), outperforming the state-of-the-art NIO system by an absolute CER reduction of 0.5%.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2023, August 21). Enhancing Audio-Visual Speech Recognition with Cross-Modal Fusion. AZoAi. Retrieved on November 21, 2024 from https://www.azoai.com/news/20230821/Enhancing-Audio-Visual-Speech-Recognition-with-Cross-Modal-Fusion.aspx.

  • MLA

    Dam, Samudrapom. "Enhancing Audio-Visual Speech Recognition with Cross-Modal Fusion". AZoAi. 21 November 2024. <https://www.azoai.com/news/20230821/Enhancing-Audio-Visual-Speech-Recognition-with-Cross-Modal-Fusion.aspx>.

  • Chicago

    Dam, Samudrapom. "Enhancing Audio-Visual Speech Recognition with Cross-Modal Fusion". AZoAi. https://www.azoai.com/news/20230821/Enhancing-Audio-Visual-Speech-Recognition-with-Cross-Modal-Fusion.aspx. (accessed November 21, 2024).

  • Harvard

    Dam, Samudrapom. 2023. Enhancing Audio-Visual Speech Recognition with Cross-Modal Fusion. AZoAi, viewed 21 November 2024, https://www.azoai.com/news/20230821/Enhancing-Audio-Visual-Speech-Recognition-with-Cross-Modal-Fusion.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Don't Panic: A Hitchhiker’s Guide to Scaling Laws for Large Language Models