Researchers from Meta AI recently introduced EXPRESSO, a sizeable high-quality dataset of expressive speech, along with a benchmark for discrete textless speech resynthesis that preserves expressive styles. Submitted to Interspeech, the work enables building systems beyond neutral read speech to capture diverse vocal expressions like emotions, accents, emphasis, and non-verbal sounds.
Recent progress in self-supervised speech models has enabled discrete representations of speech without needing text annotations. Built from more varied data than just read speech, these models open up possibilities for expressive speech synthesis not limited by impoverished text. However, most datasets need more diversity, quality, and spontaneous expressiveness to explore this potential fully.
To address this limitation, the researchers introduced EXPRESSO, comprising improvised dialogues and parallel expressive readings across 26 recognizable vocal styles. They also propose an expressive resynthesis benchmark using this data, where systems must encode speech into discrete units and resynthesize it, preserving content, speaker identity, and expressive style.
The EXPRESSO Dataset
EXPRESSO contains 47 hours of speech from 4 speakers of North American English. Of that, 37% consists of short prompt readings in 7 core styles, applied in parallel, along with emphasis and long-form narrative material. The remaining 63% comprises improvised conversations eliciting 25 expressive styles through situational prompts.
The styles capture universal emotions, speech mannerisms, accents, and non-verbals. The improvisations introduce realism lacking in most acted speech. Data was professionally recorded in a studio environment. The corpus strikes a balance between content diversity and phonetic coverage.
The authors formulate an expressive resynthesis task using EXPRESSO to demonstrate its capabilities. The input audio containing particular content and style must be encoded into discrete units and resynthesized in a target voice while preserving linguistic information and expressivity.
These challenges include modeling, quantizing, decoding the units, and conditioning the output on speaker and style. The discrete bottleneck forces compression of all necessary details. Metrics evaluate content, pitch, and style preservation using an Automatic Speech Recognition (ASR) model, F0 analysis, and an expressive classifier.
Models Investigated
The study compares multiple self-supervised speech encoders like HuBERT and Encodec trained on either read speech or more diverse data. The units are clustered using k-means applied on an Semi-Supervised Learning (SSL) model's hidden states or Encodec discrete codes. HiFi- Generative Adversarial Network (GAN) generates the final speech conditioned on speaker ID and style ID where applicable.
Encodec gives high-bit rate baselines as a generic audio compression model. HuBERT produces lower bitrate phonetic units without compression objectives. To analyze tradeoffs, the authors ablate conditioning strategies, model architectures, tokenizer training data, and other factors.
Encodec systems significantly outperformed HuBERT-based ones in pitch preservation while trailing slightly in content accuracy. HuBERT units learned from noisier data proved more robust than those trained on read speech. Clustering on EXPRESSO gave better phonetic quality than clustering on upstream SSL data.
Style modeling greatly benefited from explicitly conditioning the decoder, but some leakage through units was also observed. Improvements resulted from training the tokenizer on EXPRESSO versus upstream SSL data. Out-of-domain generalization of style remains a challenge.
Discussion
This research provides a critical infrastructure to spur progress in expressive speech synthesis. The EXPRESSO dataset captures diverse styles through a unique combination of natural improvisations and parallel expressive readings. The resynthesis benchmark facilitates model development and analysis using this data.
Key findings include the advantages of compression-based versus phonetic units for prosody modeling and clustering on target domain data. The results also highlight remaining challenges like out-of-domain style transfer. The public release of datasets, models, and metrics will support further work in this exciting field.
Building capable expressive speech systems can enhance naturalness and human-computer interaction across applications like virtual assistants, audiobooks, and accessibility tools. However, ethical considerations regarding data collection and appropriate system use will be critical as the technology evolves. EXPRESSO is an essential early step towards inclusive and empowering speech technologies.
Future Outlook
Using this data, the authors formulate an expressive resynthesis benchmark to demonstrate model capabilities. Experiments with self-supervised speech encoders reveal tradeoffs between compression versus phonetic units and the effects of decoder conditioning strategies.
While challenges remain in effectively transferring expressivity, especially across domains, EXPRESSO provides a strong impetus for progress. Its professional studio-quality recordings capture spontaneity lacking in previously acted datasets. The public release of datasets, models, and metrics will support the community in developing more nuanced speech systems. Overall, this work helps lay the groundwork for enabling technologies to understand and generate the full range of human vocal expression.