In an article recently posted to the Meta Research website, researchers introduced Efficient Monotonic Multihead Attention (EMMA), a cutting-edge model for simultaneous translation with stable monotonic alignment estimation. The focus was on reducing machine translation latency, which is crucial for real-time applications like international conferences.
Background
Simultaneous translation, a task aimed at minimizing machine translation system latency, is crucial for real-time applications like international conferences and personal travels. Unlike traditional offline models that process an entire input sentence before generating output, simultaneous models operate on partial input sequences. These models incorporate policies to decide when to generate translation output, involving actions such as reading and writing. Monotonic attention-based policies, particularly Transformer-based Monotonic Multihead Attention (MMA), have excelled in text-to-text translation tasks.
In simultaneous translation, the model initiates translation before the speaker completes the sentence. Monotonic attention models rely on learned policies for alignment estimation during training, making them well-suited for these scenarios. However, when applied to speech input, MMA faces challenges highlighted by suboptimal performance. Issues arise from numerical instability, bias in alignment estimation, and significant variance in alignment, particularly in later parts of sentences, due to the continuous nature of encoder states.
To address these challenges, the present paper introduced EMMA, offering a novel, numerically stable, and unbiased monotonic alignment estimation, proving effective in both simultaneous text-to-text and speech-to-text translation tasks. The model also introduced strategies for reducing monotonic alignment variance and included regularization of latency. Furthermore, the training scheme was enhanced by fine-tuning from a pre-trained offline model.
EMMA
Researchers broke down the model into three key aspects. EMMA model is discussed in numerically stable estimation, alignment shaping, and simultaneous fine-tuning. EMMA's monotonic alignment estimation, denoted as α, was based on a single attention head, with the same estimation applied to every attention head in the Transformer-based MMA. The infinite lookback variant of monotonic attention was emphasized.
- Numerically Stable Estimation:
EMMA addressed numerical instability in alignment estimation by introducing an innovative numerically stable approach. The closed-form estimation involved a transition matrix, ensuring stability and unbiasedness without the need for a problematic denominator.
- Alignment Shaping:
Latency regularization was introduced to prevent the model from learning a trivial policy during training. Expected delays were estimated from the alignment, and a latency regularization term was added to the loss function. Additionally, an alignment variance reduction strategy was proposed, introducing an enhanced stepwise probability network and a variance loss term in the objective function.
- Simultaneous Fine-tuning:
Simultaneous fine-tuning was introduced as a method to enhance adaptability and leverage recent advancements in large foundational translation models. This involved initializing the offline encoder-decoder model and optimizing only the decoder and policy network during training, assuming that the generative components closely resembled those of the offline model.
- Streaming Inference:
For streaming speech input, the inference pipeline used SimulEval, updating the encoder with each new speech chunk and running the decoder to generate partial text translations based on the policy. This streaming inference process ensured real-time translation for applications like simultaneous speech-to-text translation.
EMMA offered numerically stable alignment estimation, introduced strategies for alignment shaping and simultaneous fine-tuning, and facilitated streaming inference for real-time applications.
Experimental Setup
The proposed models for speech-to-text translation were evaluated using the SimulEval toolkit, focusing on quality (detokenized BiLingual Evaluation Understudy (BLEU)) and latency (Average Lagging). The simultaneous fine-tuning strategy was followed, initializing the simultaneous model from an offline translation model. Two experimental configurations, bilingual and multilingual, were established for the speech-to-text task. The bilingual setup involved training models for each language direction (Spanish-English and English-Spanish), while the multilingual task demonstrated adaptation from an existing large-scale multilingual model, SeamlessM4T.
Recent research emphasized the neural end-to-end approach for speech-to-text tasks, aiming for simplicity and efficiency. Initial attempts showed a quality decrease compared to cascade approaches, but subsequent studies improved performance with additional layers. Transformer models, successful in text translation, have been applied to speech translation, achieving quality and training speed improvements.
Simultaneous translation policies fell into three categories: predefined context-free rule-based policies, learnable flexible policies with reinforcement learning, and models using monotonic attention. Monotonic attention, with closed-form expected attention, has shown advancements in online decoding efficiency and translation quality.
In the experimental setup, models were initialized and fine-tuned for both bilingual and multilingual scenarios, demonstrating adaptability and leveraging pre-trained models for efficient training and performance evaluation. The bilingual setup used a pre-trained wav2vec 2.0 encoder and mBART decoder, while the multilingual setting initializes the model with the S2T part of an offline SeamlessM4T model.
Conclusion
In conclusion, the study introduced EMMA for simultaneous speech-to-text translation. EMMA addressed numerical instability, alignment shaping, and simultaneous fine-tuning, achieving state-of-the-art performance. Experimental evaluations emphasizing quality and latency demonstrated the model's efficacy in bilingual and multilingual setups. The adaptation of transformer-based monotonic attention proved crucial for real-time, context-aware speech translation in diverse linguistic scenarios.