SONAR EXPRESSIVE: Advancing Zero-Shot Expressive Speech-to-Speech Translation

In an article recently posted to the Meta Research website, scientists proposed the SONAR EXPRESSIVE model for zero-shot expressive speech-to-speech translation (S2ST) by complementing the SONAR sentence-level speech embeddings with speech properties embeddings.  

Study: SONAR EXPRESSIVE: Advancing Zero-Shot Expressive Speech-to-Speech Translation. Image credit: 1st footage/Shutterstock
Study: SONAR EXPRESSIVE: Advancing Zero-Shot Expressive Speech-to-Speech Translation. Image credit: 1st footage/Shutterstock

Background

In the last few years, S2ST has made substantial progress and end-to-end trained systems have evolved, which can outperform conventional cascaded approaches. The T-Modules architecture is an alternative framework to connect trained text/speech decoders and encoders with a fixed-sized multimodal and multilingual sentence embedding space.

Initially, this approach was based on the LASER sentence embedding and later extended to a space designated as SONAR. The T-Modules architecture has delivered competitive performance in zero-shot S2ST and zero-shot speech-to-text translation (S2TT).

However, all these approaches only focus on preserving the spoken sentence meaning as multimodal and multilingual sentence representations such as SONAR are primarily trained to capture only the encoded speech/text meaning. Human oral communication also conveys additional information, such as emotion, prosody, pitch, and speech rate. These speech characteristics are crucial to understanding the intention and message in oral communication correctly.

The proposed approach

In this study, researchers proposed to extend the T-Modules architecture by introducing an additional embedding for capturing the generic speech characteristics. They trained a speech decoder model in the SONAR framework, which can decode both multilingual and multimodal SONAR sentence embeddings into expressive speech.

An English speech decoder was trained on paired speech-text data and monolingual raw speech data using a modular training strategy to decode SONAR embeddings that were computed with pre-trained speech/text encoders. The English speech decoder at inference time could decode spoken languages unseen during training to perform zero-shot S2ST.

Researchers also introduced an additional embedding disentangled from the SONAR semantic representations, designated as SPEECHPROP embedding, to encode expressive and prosody properties of the speech modality that were unrepresented by SONAR semantic embeddings. The combined system consisting of the expressivity-aware speech decoder and SPEECHPROP was designated as SONAR EXPRESSIVE. EnCodec units were used as the target for the unit decoder for generating diverse speech, as these units are trained to build compressed audio representations. Additionally, the decoder in the EnCodec model can produce speech waveforms from units.

Researchers performed zero-shot S2ST from Italian, French, Spanish, Chinese, or German into English. However, the proposed approach was generic and can be applied to all languages. They used the original SONAR English speech encoder and trained a new single speech encoder for the remaining languages. A multi-stage training approach was adopted by initially using only unlabeled monolingual speech data for training and then introducing non-expressivity-aligned S2T data to generate expressively aligned target speech in a zero-shot cross-modal way.

Experimental evaluation

Researchers evaluated their models on both MEXPRESSO and FLEURS benchmark datasets. The content translation quality of the proposed expressive translation/SONAR EXPRESSIVE system was evaluated using ASR-BLEU by calculating ASR-BLEU results for each target language across both benchmark datasets during each model training stage.

The speech decoder was conditioned using different semantic embeddings to analyze the cross-modal and cross-lingual transfer. Three embeddings were used, including one extracted from source non-English transcription, one from non-English source speech, and one extracted from target English text.

These three setups performed text-to-speech translation (T2ST), zero-shot S2ST, and TTS, respectively. Additionally, the prosodic qualities of the translation system were assessed using different expressivity metrics.

A pre-trained WavLM-based speaker style encoder was used to extract the speaker style embeddings of target and source speech, and the speaker style similarity was then measured as the cosine between target and source to determine the expressivity dimensions captured by the SPEECHPROP embedding. The rhythmic patterns were captured by comparing both the pause alignment and the speech rate.

Significance of the study

SONAR EXPRESSIVE efficiently performed TTS, with the performance improving significantly after the second stage of pre-training (PT2) on FLEURS. No such performance gain was observed after PT2 on MEXPRESSO. However, introducing SPEECHPROP embeddings during fine-tuning resulted in some loss in ASR-BLEU. The PT2 also improved the system performance during T2ST by improving the robustness of the speech decoder to other languages.

Similarly, the zero-shot S2ST ASR-BLEU results improved significantly after PT2 when multilingual text inputs were added to the training, which confirmed that multilingual inputs can increase the robustness of the speech decoder. Models not trained using SPEECHPROP embeddings generated output speech with a low speaker style similarity to an input speech.

However, using SPEECHPROP embeddings in training during fine-tuning stages significantly increased the speaker style similarity between the target- and source-generated speech across all languages. Additionally, the pause alignment and speech rate Spearman correlation results demonstrated that introducing SPEECHPROP embedding led to large increases across both metrics.

Journal reference:

Article Revisions

  • Dec 6 2023 - Intro paragraph adjusted from "In an article recently published in Meta, researchers proposed the SONAR EXPRESSIVE" to "In an article recently posted to the Meta Research website, scientists proposed the SONAR EXPRESSIVE"
Samudrapom Dam

Written by

Samudrapom Dam

Samudrapom Dam is a freelance scientific and business writer based in Kolkata, India. He has been writing articles related to business and scientific topics for more than one and a half years. He has extensive experience in writing about advanced technologies, information technology, machinery, metals and metal products, clean technologies, finance and banking, automotive, household products, and the aerospace industry. He is passionate about the latest developments in advanced technologies, the ways these developments can be implemented in a real-world situation, and how these developments can positively impact common people.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Dam, Samudrapom. (2023, December 06). SONAR EXPRESSIVE: Advancing Zero-Shot Expressive Speech-to-Speech Translation. AZoAi. Retrieved on November 21, 2024 from https://www.azoai.com/news/20231205/SONAR-EXPRESSIVE-Advancing-Zero-Shot-Expressive-Speech-to-Speech-Translation.aspx.

  • MLA

    Dam, Samudrapom. "SONAR EXPRESSIVE: Advancing Zero-Shot Expressive Speech-to-Speech Translation". AZoAi. 21 November 2024. <https://www.azoai.com/news/20231205/SONAR-EXPRESSIVE-Advancing-Zero-Shot-Expressive-Speech-to-Speech-Translation.aspx>.

  • Chicago

    Dam, Samudrapom. "SONAR EXPRESSIVE: Advancing Zero-Shot Expressive Speech-to-Speech Translation". AZoAi. https://www.azoai.com/news/20231205/SONAR-EXPRESSIVE-Advancing-Zero-Shot-Expressive-Speech-to-Speech-Translation.aspx. (accessed November 21, 2024).

  • Harvard

    Dam, Samudrapom. 2023. SONAR EXPRESSIVE: Advancing Zero-Shot Expressive Speech-to-Speech Translation. AZoAi, viewed 21 November 2024, https://www.azoai.com/news/20231205/SONAR-EXPRESSIVE-Advancing-Zero-Shot-Expressive-Speech-to-Speech-Translation.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.