Researchers from Meta AI recently proposed SeamlessM4T—a single model supporting speech-to-speech (S2S), text-to-speech (T2S), and text-to-text translation (T2TT) for 100 languages. Leveraging vast audio data and self-supervised speech representations, SeamlessM4T outperforms prior models.
Background
Creating the Babel Fish, a universal speech translation tool still remains a challenging endeavor for scientists. While text-based models have expanded translation capabilities, unified speech-to-speech models lag. Conventional systems rely on multiple subsystems, hindering scalability. To bridge this gap, researchers have developed SeamlessM4T, which achieves remarkable results in speech-to-text translation (S2TT), improving quality and safety and being open-sourced for further advancement.
Prioritizing speech in machine translation
Machine translation (MT) has primarily focused on text due to its ease of handling and abundance. However, speech is distinct, with unique grammar, registers, and expressive qualities. It fosters stronger social bonds compared to text-based communication. Current speech translation models have limitations. Cascaded systems chain various subsystems, while direct S2T models have language coverage issues. AudioPaLM stands as the current state-of-the-art, bridging the gap between text and speech translation.
The current study aims to create a unified large model capable of handling speech and text translation tasks, expand language coverage, and ensure systematic evaluations for safe and equitable performance. It seeks to bridge the translation gap between high- and low-resource languages, making translation technology accessible to all.
SeamlessAlign: Automatically creating aligned data for speech
Creating an effective multilingual and multimodal translation system such as SeamlessM4T demands substantial resources spanning multiple languages and modalities. While some human-annotated translation resources are freely accessible, they often cover a limited set of languages or specific domains. Collections involving the speech modality, such as a diverse, multilingual S2TT corpus (CoVoST) and multilingual TEDx, also exist. However, there is currently no open dataset that matches the scale of initiatives such as Whisper, which have shown exceptional performance.
To address this, the parallel data mining technique emerges as an alternative to closed data, offering broader language coverage and larger corpus sizes. The prevailing approach involves encoding sentences from various languages and modalities into a shared fixed-size embedding space (Sonar), identifying parallel instances based on similarity metrics, and performing mining through pairwise comparisons on extensive monolingual corpora.
This method, initially introduced with the multilingual laser space, has been scaled to 200 languages and the speech modality through teacher-student training. The dataset SeamlessAlign, was generated using the parallel data mining technique. This dataset is the most extensive open dataset for multimodal translation to date, totaling 470,000 hours. Researchers introduced several enhancements, including improving the speech language identification (LID) model, increased language coverage, and a substantial increase in raw audio data.
SeamlessM4T models
Significant advancements in direct S2TT models have been witnessed recently. These models have achieved parity with cascaded models in specific scenarios, such as constrained data, in-domain settings, and specific language pairs. However, the landscape has evolved with the emergence of massively multilingual translation models and weakly supervised automatic speech recognition (ASR) models. This shift renders previous comparisons obsolete, highlighting the significant lag of direct models compared to robust cascaded models.
SeamlessM4T aims to make direct and cascaded models more similar in the context of translating speech to text in various languages and formats. It endeavors to achieve this by constructing a potent direct text and speech-into-text (X2T) model proficient in translating both speech and text into text. This model combines a robust speech representation learning model with an immensely multilingual T2TT model.
Additionally, SeamlessM4T explores S2TT with UnitY, a two-step model. First, it creates text, and then it predicts sound units. Unlike other models that use separate components, UnitY's parts can work together, fixing problems and differences between them. It uses a middle-level meaning representation to handle different sources and targets. The vocoders used for speech synthesis are trained separately.
The SeamlessM4T model consists of four core building blocks: a massively multilingual T2TT model (SeamlessM4T-NLLB), a speech representation learning model utilizing unlabeled speech audio data (w2v-BERT 2.0), a text-to-unit sequence-to-sequence model (T2U), and a multilingual HiFi-GAN unit vocoder for speech synthesis from units.
To achieve its objectives, SeamlessM4T employs multi-task UnitY models that integrate components from the first three building blocks. These models undergo fine-tuning in three stages, starting with an X2T model with an English target and culminating in a versatile multitask UnitY system capable of T2TT, S2TT, S2ST, and ASR tasks.
The model description covers different stages of the SeamlessM4T architecture, including initial training with w2v-BERT 2.0, creating text, preparing data for speech-to-speech translation, training for text-to-unit conversion, and the final fine-tuning stage. SeamlessM4T's performance is evaluated using standard automatic metrics and compared to state-of-the-art speech translation models, showcasing its strengths in various translation tasks.
To evaluate the model, researchers deployed Blaser 2.0, a versatile metric accommodating both speech and text. Human assessments are centered on retaining speaker intent and audio quality. The model showed superior robustness, even when dealing with background noise and speaker variations, as evidenced by the BLEU-SNR and WER-SNR curves.
Responsible AI
In line with responsible system development, the focus is on evaluating added toxicity and bias, crucial for safe system deployment. Fair translation outputs, devoid of bias, are essential. Toxicity analysis utilizes a new metric, ASR-ETOX, while gender bias assessment is based on masculine and feminine references. Results show variations across languages and datasets. Additionally, the study investigates gender representation in datasets, finding an overrepresentation of masculine terms. However, this approach has limitations, including the reliance on word lists and linguistic gender cues for bias detection.
Conclusion
In summary, the model SeamlessM4T addresses the limitations of existing speech translation systems. It supports ASR, T2TT, S2TT, T2ST, and S2ST for multiple languages. Developed using extensive audio data and self-supervised speech representations, SeamlessM4T outperforms previous models in various translation tasks. It excels in S2T, T2S, and more, with an open-source approach for further advancement. Additionally, SeamlessM4T demonstrates reduced toxicity and improved robustness, marking significant progress in responsible AI.