Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Researchers from Meta AI recently introduced EXPRESSO, a sizeable high-quality dataset of expressive speech, along with a benchmark for discrete textless speech resynthesis that preserves expressive styles. Submitted to Interspeech, the work enables building systems beyond neutral read speech to capture diverse vocal expressions like emotions, accents, emphasis, and non-verbal sounds.

Study: EXPRESSO: A Breakthrough Dataset and Benchmark for Expressive Speech Synthesis. Image credit: metamorworks/Shutterstock.
Study: EXPRESSO: A Breakthrough Dataset and Benchmark for Expressive Speech Synthesis. Image credit: metamorworks/Shutterstock.

Recent progress in self-supervised speech models has enabled discrete representations of speech without needing text annotations. Built from more varied data than just read speech, these models open up possibilities for expressive speech synthesis not limited by impoverished text. However, most datasets need more diversity, quality, and spontaneous expressiveness to explore this potential fully.

To address this limitation, the researchers introduced EXPRESSO, comprising improvised dialogues and parallel expressive readings across 26 recognizable vocal styles. They also propose an expressive resynthesis benchmark using this data, where systems must encode speech into discrete units and resynthesize it, preserving content, speaker identity, and expressive style.

The EXPRESSO Dataset

EXPRESSO contains 47 hours of speech from 4 speakers of North American English. Of that, 37% consists of short prompt readings in 7 core styles, applied in parallel, along with emphasis and long-form narrative material. The remaining 63% comprises improvised conversations eliciting 25 expressive styles through situational prompts.

The styles capture universal emotions, speech mannerisms, accents, and non-verbals. The improvisations introduce realism lacking in most acted speech. Data was professionally recorded in a studio environment. The corpus strikes a balance between content diversity and phonetic coverage.

The authors formulate an expressive resynthesis task using EXPRESSO to demonstrate its capabilities. The input audio containing particular content and style must be encoded into discrete units and resynthesized in a target voice while preserving linguistic information and expressivity.

These challenges include modeling, quantizing, decoding the units, and conditioning the output on speaker and style. The discrete bottleneck forces compression of all necessary details. Metrics evaluate content, pitch, and style preservation using an Automatic Speech Recognition (ASR) model, F0 analysis, and an expressive classifier.

Models Investigated

The study compares multiple self-supervised speech encoders like HuBERT and Encodec trained on either read speech or more diverse data. The units are clustered using k-means applied on an Semi-Supervised Learning (SSL) model's hidden states or Encodec discrete codes. HiFi- Generative Adversarial Network (GAN) generates the final speech conditioned on speaker ID and style ID where applicable.

Encodec gives high-bit rate baselines as a generic audio compression model. HuBERT produces lower bitrate phonetic units without compression objectives. To analyze tradeoffs, the authors ablate conditioning strategies, model architectures, tokenizer training data, and other factors.

Encodec systems significantly outperformed HuBERT-based ones in pitch preservation while trailing slightly in content accuracy. HuBERT units learned from noisier data proved more robust than those trained on read speech. Clustering on EXPRESSO gave better phonetic quality than clustering on upstream SSL data.

Style modeling greatly benefited from explicitly conditioning the decoder, but some leakage through units was also observed. Improvements resulted from training the tokenizer on EXPRESSO versus upstream SSL data. Out-of-domain generalization of style remains a challenge.

Discussion

This research provides a critical infrastructure to spur progress in expressive speech synthesis. The EXPRESSO dataset captures diverse styles through a unique combination of natural improvisations and parallel expressive readings. The resynthesis benchmark facilitates model development and analysis using this data.

Key findings include the advantages of compression-based versus phonetic units for prosody modeling and clustering on target domain data. The results also highlight remaining challenges like out-of-domain style transfer. The public release of datasets, models, and metrics will support further work in this exciting field.

Building capable expressive speech systems can enhance naturalness and human-computer interaction across applications like virtual assistants, audiobooks, and accessibility tools. However, ethical considerations regarding data collection and appropriate system use will be critical as the technology evolves. EXPRESSO is an essential early step towards inclusive and empowering speech technologies.

Future Outlook

Using this data, the authors formulate an expressive resynthesis benchmark to demonstrate model capabilities. Experiments with self-supervised speech encoders reveal tradeoffs between compression versus phonetic units and the effects of decoder conditioning strategies.

While challenges remain in effectively transferring expressivity, especially across domains, EXPRESSO provides a strong impetus for progress. Its professional studio-quality recordings capture spontaneity lacking in previously acted datasets. The public release of datasets, models, and metrics will support the community in developing more nuanced speech systems. Overall, this work helps lay the groundwork for enabling technologies to understand and generate the full range of human vocal expression.

Journal reference:
Aryaman Pattnayak

Written by

Aryaman Pattnayak

Aryaman Pattnayak is a Tech writer based in Bhubaneswar, India. His academic background is in Computer Science and Engineering. Aryaman is passionate about leveraging technology for innovation and has a keen interest in Artificial Intelligence, Machine Learning, and Data Science.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Pattnayak, Aryaman. (2023, September 24). Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. AZoAi. Retrieved on January 15, 2025 from https://www.azoai.com/news/20230924/Expresso-A-Benchmark-and-Analysis-of-Discrete-Expressive-Speech-Resynthesis.aspx.

  • MLA

    Pattnayak, Aryaman. "Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis". AZoAi. 15 January 2025. <https://www.azoai.com/news/20230924/Expresso-A-Benchmark-and-Analysis-of-Discrete-Expressive-Speech-Resynthesis.aspx>.

  • Chicago

    Pattnayak, Aryaman. "Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis". AZoAi. https://www.azoai.com/news/20230924/Expresso-A-Benchmark-and-Analysis-of-Discrete-Expressive-Speech-Resynthesis.aspx. (accessed January 15, 2025).

  • Harvard

    Pattnayak, Aryaman. 2023. Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. AZoAi, viewed 15 January 2025, https://www.azoai.com/news/20230924/Expresso-A-Benchmark-and-Analysis-of-Discrete-Expressive-Speech-Resynthesis.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
SALAD Model Redefines Text-to-Speech with Continuous Diffusion