In an article recently submitted to the ArXiv* server, researchers proposed a new approach to improve the audio-visual speech recognition (AVSR) systems using cross-modal fusion encoder (CMFE) and visual pre-training based on lip-subword correlation.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
AVSR is primarily a multi-modality application based on the bi-modal/auditory and visual nature of perception during speech communication between humans. In this application, lip movement is used as a complementary modality to enhance automatic speech recognition (ASR) performance.
Previously, handcrafted lip features were extracted and included in hybrid ASR systems. However, end-to-end AVSR (E2E-AVSR) systems have received significant attention in recent years owing to their design simplicity and extensive availability of public audio-visual databases. Although the E2E-AVSR systems have demonstrated their effectiveness on several parameters, several issues still remain that must be addressed to make them feasible for practical applications.
For instance, far-field low-quality videos cannot significantly improve AVSRs' performance under a common end-to-end framework compared to audio-only ASR systems. Additionally, a similar degradation in performance occurs from a uni-modal network to a multi-modal network.
Learning a large integrated neural network in a multi-modal model is more challenging compared to a uni-modal model due to specialized input representations and exceptional convergence rates between two/audio and visual modalities. Pre-training techniques can be used to decouple the end-to-end one-pass training framework in two stages, which involves pre-training the uni-modal networks and then integrating them into a fusion model after unified fine-tuning.
Thus, this strategy can effectively address the learning dynamic variations between modalities and exploit their interactions. Identifying the methods to pre-train the visual front-end is a critical aspect of the decoupled training framework. Although a pre-trained visual front-end can be directly used as a frozen visual embedding extractor, the approach can only lead to small improvement owing to the various feature distributions across the target and source domains.
In several studies, researchers used an isolated word recognition task algorithm for visual front-end pre-training and then fine-tuned the front end using the AVSR model. However, data recordings of paired labels and isolated words are required in these algorithms, which cannot be collected easily on a large scale.
In a recent study, self-supervised learning was leveraged for AVSR on unlabeled large-scale data sets. Although such pre-training methods slightly improve the performances of the AVSR system, they use a large amount of additional unlabeled and labeled data.
Improving AVSR performance using a new approach
In this paper, researchers proposed two novel techniques to improve AVSR under a pre-training and fine-tuning training framework. Initially, researchers investigated the correlation between syllable-level subword units in Mandarin and lip shapes to produce good syllable boundaries at the frame level from lip shapes to realize precise alignment of audio and video streams during crossmodal fusion and visual model pre-training.
Subsequently, researchers proposed an audio-guided CMFE neural network to use the main training parameters for several cross-modal attention layers to fully leverage the modality complementarity. The proposed subword-correlated visual pretraining technique in this study did not require additional manually labeled word boundaries or data.
A set of hidden Markov models (HMM) was trained using the Gaussian mixture model (GMM-HMMs) on far-field audio to create frame-level alignment labels and pre-train visual front-end by identifying syllable-related HMM states corresponding to every visual frame.
Moreover, the fine-grained alignment labels guided the network to focus on learning the extraction of visual features of low-quality videos. Thus, the proposed pre-training method explicitly offered syllable boundaries to produce a frame-level direct mapping from lip shapes to Mandarin syllables, which differed from the end-to-end continuous lip-reading-based pre-training method.
The proposed method can also be considered a cross-modal conversion process accepting video frames as inputs and generating acoustic subword sequences as outputs. Additionally, the acoustic information generated from lip movements can contribute to an efficient adaptation process in the fusion stage with the audio stream.
In the decoupled training fusion stage, the initialized visual and audio branches possessed the uni-modal representation extraction ability, which allowed the use of more training parameters to build modality fusion models. Several crossmodal fusions occurred at various layers in the audio modality-dominated novel CMFE.
Significance of the study
The final AVSR system based on the proposed techniques demonstrated a better performance on the Multimodal Information Based Speech Processing 2021 (MISP2021)-AVSR data set compared to state-of-the-art systems without using additional training data and complex back-ends and front-ends, which indicated the effectiveness of these two techniques.
The AVSR system comprised of weighted prediction error (WPE) and guided source separation (WPE+GSS) in the front-end and CMFE as the backend encoder and initialized using 500 hours of training data for the audio branch and no extra training data for the visual branch attained 24.58% character error rate (CER), outperforming the state-of-the-art NIO system by an absolute CER reduction of 0.5%.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Dai, Y., Chen, H., Du, J., Ding, X., Ding, N., Jiang, F., & Lee, C. (2023). Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder. ArXiv. https://doi.org/10.48550/arXiv.2308.08488, https://arxiv.org/abs/2308.08488