In an article recently posted to the Meta Research website, scientists proposed the SONAR EXPRESSIVE model for zero-shot expressive speech-to-speech translation (S2ST) by complementing the SONAR sentence-level speech embeddings with speech properties embeddings.
Background
In the last few years, S2ST has made substantial progress and end-to-end trained systems have evolved, which can outperform conventional cascaded approaches. The T-Modules architecture is an alternative framework to connect trained text/speech decoders and encoders with a fixed-sized multimodal and multilingual sentence embedding space.
Initially, this approach was based on the LASER sentence embedding and later extended to a space designated as SONAR. The T-Modules architecture has delivered competitive performance in zero-shot S2ST and zero-shot speech-to-text translation (S2TT).
However, all these approaches only focus on preserving the spoken sentence meaning as multimodal and multilingual sentence representations such as SONAR are primarily trained to capture only the encoded speech/text meaning. Human oral communication also conveys additional information, such as emotion, prosody, pitch, and speech rate. These speech characteristics are crucial to understanding the intention and message in oral communication correctly.
The proposed approach
In this study, researchers proposed to extend the T-Modules architecture by introducing an additional embedding for capturing the generic speech characteristics. They trained a speech decoder model in the SONAR framework, which can decode both multilingual and multimodal SONAR sentence embeddings into expressive speech.
An English speech decoder was trained on paired speech-text data and monolingual raw speech data using a modular training strategy to decode SONAR embeddings that were computed with pre-trained speech/text encoders. The English speech decoder at inference time could decode spoken languages unseen during training to perform zero-shot S2ST.
Researchers also introduced an additional embedding disentangled from the SONAR semantic representations, designated as SPEECHPROP embedding, to encode expressive and prosody properties of the speech modality that were unrepresented by SONAR semantic embeddings. The combined system consisting of the expressivity-aware speech decoder and SPEECHPROP was designated as SONAR EXPRESSIVE. EnCodec units were used as the target for the unit decoder for generating diverse speech, as these units are trained to build compressed audio representations. Additionally, the decoder in the EnCodec model can produce speech waveforms from units.
Researchers performed zero-shot S2ST from Italian, French, Spanish, Chinese, or German into English. However, the proposed approach was generic and can be applied to all languages. They used the original SONAR English speech encoder and trained a new single speech encoder for the remaining languages. A multi-stage training approach was adopted by initially using only unlabeled monolingual speech data for training and then introducing non-expressivity-aligned S2T data to generate expressively aligned target speech in a zero-shot cross-modal way.
Experimental evaluation
Researchers evaluated their models on both MEXPRESSO and FLEURS benchmark datasets. The content translation quality of the proposed expressive translation/SONAR EXPRESSIVE system was evaluated using ASR-BLEU by calculating ASR-BLEU results for each target language across both benchmark datasets during each model training stage.
The speech decoder was conditioned using different semantic embeddings to analyze the cross-modal and cross-lingual transfer. Three embeddings were used, including one extracted from source non-English transcription, one from non-English source speech, and one extracted from target English text.
These three setups performed text-to-speech translation (T2ST), zero-shot S2ST, and TTS, respectively. Additionally, the prosodic qualities of the translation system were assessed using different expressivity metrics.
A pre-trained WavLM-based speaker style encoder was used to extract the speaker style embeddings of target and source speech, and the speaker style similarity was then measured as the cosine between target and source to determine the expressivity dimensions captured by the SPEECHPROP embedding. The rhythmic patterns were captured by comparing both the pause alignment and the speech rate.
Significance of the study
SONAR EXPRESSIVE efficiently performed TTS, with the performance improving significantly after the second stage of pre-training (PT2) on FLEURS. No such performance gain was observed after PT2 on MEXPRESSO. However, introducing SPEECHPROP embeddings during fine-tuning resulted in some loss in ASR-BLEU. The PT2 also improved the system performance during T2ST by improving the robustness of the speech decoder to other languages.
Similarly, the zero-shot S2ST ASR-BLEU results improved significantly after PT2 when multilingual text inputs were added to the training, which confirmed that multilingual inputs can increase the robustness of the speech decoder. Models not trained using SPEECHPROP embeddings generated output speech with a low speaker style similarity to an input speech.
However, using SPEECHPROP embeddings in training during fine-tuning stages significantly increased the speaker style similarity between the target- and source-generated speech across all languages. Additionally, the pause alignment and speech rate Spearman correlation results demonstrated that introducing SPEECHPROP embedding led to large increases across both metrics.
Article Revisions
- Dec 6 2023 - Intro paragraph adjusted from "In an article recently published in Meta, researchers proposed the SONAR EXPRESSIVE" to "In an article recently posted to the Meta Research website, scientists proposed the SONAR EXPRESSIVE"