Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

Download PDF Copy

By Aryaman PattnayakReviewed by Susha Cheriyedath, M.Sc.Sep 24 2023

Researchers from Meta AI recently introduced EXPRESSO, a sizeable high-quality dataset of expressive speech, along with a benchmark for discrete textless speech resynthesis that preserves expressive styles. Submitted to Interspeech, the work enables building systems beyond neutral read speech to capture diverse vocal expressions like emotions, accents, emphasis, and non-verbal sounds.

*Study: EXPRESSO: A Breakthrough Dataset and Benchmark for Expressive Speech Synthesis. Image credit: metamorworks/Shutterstock.*

Recent progress in self-supervised speech models has enabled discrete representations of speech without needing text annotations. Built from more varied data than just read speech, these models open up possibilities for expressive speech synthesis not limited by impoverished text. However, most datasets need more diversity, quality, and spontaneous expressiveness to explore this potential fully.

To address this limitation, the researchers introduced EXPRESSO, comprising improvised dialogues and parallel expressive readings across 26 recognizable vocal styles. They also propose an expressive resynthesis benchmark using this data, where systems must encode speech into discrete units and resynthesize it, preserving content, speaker identity, and expressive style.

The EXPRESSO Dataset

EXPRESSO contains 47 hours of speech from 4 speakers of North American English. Of that, 37% consists of short prompt readings in 7 core styles, applied in parallel, along with emphasis and long-form narrative material. The remaining 63% comprises improvised conversations eliciting 25 expressive styles through situational prompts.

The styles capture universal emotions, speech mannerisms, accents, and non-verbals. The improvisations introduce realism lacking in most acted speech. Data was professionally recorded in a studio environment. The corpus strikes a balance between content diversity and phonetic coverage.

The authors formulate an expressive resynthesis task using EXPRESSO to demonstrate its capabilities. The input audio containing particular content and style must be encoded into discrete units and resynthesized in a target voice while preserving linguistic information and expressivity.

These challenges include modeling, quantizing, decoding the units, and conditioning the output on speaker and style. The discrete bottleneck forces compression of all necessary details. Metrics evaluate content, pitch, and style preservation using an Automatic Speech Recognition (ASR) model, F0 analysis, and an expressive classifier.

Models Investigated

The study compares multiple self-supervised speech encoders like HuBERT and Encodec trained on either read speech or more diverse data. The units are clustered using k-means applied on an Semi-Supervised Learning (SSL) model's hidden states or Encodec discrete codes. HiFi- Generative Adversarial Network (GAN) generates the final speech conditioned on speaker ID and style ID where applicable.

Encodec gives high-bit rate baselines as a generic audio compression model. HuBERT produces lower bitrate phonetic units without compression objectives. To analyze tradeoffs, the authors ablate conditioning strategies, model architectures, tokenizer training data, and other factors.

Encodec systems significantly outperformed HuBERT-based ones in pitch preservation while trailing slightly in content accuracy. HuBERT units learned from noisier data proved more robust than those trained on read speech. Clustering on EXPRESSO gave better phonetic quality than clustering on upstream SSL data.

Style modeling greatly benefited from explicitly conditioning the decoder, but some leakage through units was also observed. Improvements resulted from training the tokenizer on EXPRESSO versus upstream SSL data. Out-of-domain generalization of style remains a challenge.

Discussion

This research provides a critical infrastructure to spur progress in expressive speech synthesis. The EXPRESSO dataset captures diverse styles through a unique combination of natural improvisations and parallel expressive readings. The resynthesis benchmark facilitates model development and analysis using this data.

Key findings include the advantages of compression-based versus phonetic units for prosody modeling and clustering on target domain data. The results also highlight remaining challenges like out-of-domain style transfer. The public release of datasets, models, and metrics will support further work in this exciting field.

Building capable expressive speech systems can enhance naturalness and human-computer interaction across applications like virtual assistants, audiobooks, and accessibility tools. However, ethical considerations regarding data collection and appropriate system use will be critical as the technology evolves. EXPRESSO is an essential early step towards inclusive and empowering speech technologies.

Future Outlook

Using this data, the authors formulate an expressive resynthesis benchmark to demonstrate model capabilities. Experiments with self-supervised speech encoders reveal tradeoffs between compression versus phonetic units and the effects of decoder conditioning strategies.

While challenges remain in effectively transferring expressivity, especially across domains, EXPRESSO provides a strong impetus for progress. Its professional studio-quality recordings capture spontaneity lacking in previously acted datasets. The public release of datasets, models, and metrics will support the community in developing more nuanced speech systems. Overall, this work helps lay the groundwork for enabling technologies to understand and generate the full range of human vocal expression.

Journal reference:

Nguyen, T.A., Hsu, W.N. et al. EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. Meta AI. https://ai.meta.com/research/publications/expresso-a-benchmark-and-analysis-of-discrete-expressive-speech-resynthesis/

Posted in: AI Research News

Comments (0)

Written by

Aryaman Pattnayak

Aryaman Pattnayak is a Tech writer based in Bhubaneswar, India. His academic background is in Computer Science and Engineering. Aryaman is passionate about leveraging technology for innovation and has a keen interest in Artificial Intelligence, Machine Learning, and Data Science.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Pattnayak, Aryaman. (2023, September 24). Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. AZoAi. Retrieved on July 14, 2025 from https://www.azoai.com/news/20230924/Expresso-A-Benchmark-and-Analysis-of-Discrete-Expressive-Speech-Resynthesis.aspx.
MLA
Pattnayak, Aryaman. "Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis". AZoAi. 14 July 2025. <https://www.azoai.com/news/20230924/Expresso-A-Benchmark-and-Analysis-of-Discrete-Expressive-Speech-Resynthesis.aspx>.
Chicago
Pattnayak, Aryaman. "Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis". AZoAi. https://www.azoai.com/news/20230924/Expresso-A-Benchmark-and-Analysis-of-Discrete-Expressive-Speech-Resynthesis.aspx. (accessed July 14, 2025).
Harvard
Pattnayak, Aryaman. 2023. Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. AZoAi, viewed 14 July 2025, https://www.azoai.com/news/20230924/Expresso-A-Benchmark-and-Analysis-of-Discrete-Expressive-Speech-Resynthesis.aspx.