In an article recently posted to the Meta Research website, researchers from Meta’s Fundamental Artificial Intelligence Research (FAIR) discussed a novel model called Audiobox that can generate various audio modalities, such as speech, sound, and music, based on natural language prompts.
They demonstrated that the proposed model can generate novel styles and control multiple aspects of the audio, such as transcript, vocal, and acoustic features. Furthermore, the study proposed a new evaluation metric, Joint-CLAP, that strongly correlates with human judgment.
Background
Audio is an important part of life, but generating it automatically requires proficiency and is time-consuming. Over the last decade, significant progress has been achieved in enhancing the efficiency of large-scale generative models for a single modality such as speech, sound, or music by adopting more powerful and robust models and scaling data.
However, these generative models lack controllability in various aspects: sound generation models only have basic control, like describing "a person speaking," and would only generate mumbling human voices; speech generation models cannot create new styles based on given text description and are limited in outdoor environments.
Moreover, existing audio generative models are mostly modality-specific and can only generate either speech, music, or sound effects. In contrast, real-world audio content often contains a mix of speech, music, and sound effects. Therefore, developing audio generative models that are controllable, generalizable, and high-quality can bring transformative changes to the audio-generation process.
About the Research
This study presents Audiobox (unified model) based on flow-matching that can generate various audio modalities. Flow-matching is a novel generative modeling method derived from the continuous normalizing flow framework, which models the paths that transform samples from a simple prior distribution to the corresponding samples from the complex data distribution in a continuous manner. The study used Voicebox and SpeechFlow flow-matching-based models for transcript-guided speech generation and self-supervised speech pre-training.
Description- and example-based prompting were designed to increase controllability and unify speech and sound generation processes. They allow vocal, transcript, and other audio styles to be controlled independently during speech generation. Improving model generalization with limited number of labels, the study adopted a self-supervised infilling objective to train large amount of unlabeled audio data.
Methodologies Used
The authors first pre-trained a unified model called AUDIOBOX SSL using large quantities of unlabeled speech, music, and sound effects, then fine-tune for transcript-guided speech generation (AUDIOBOX SPEECH) and description-guided sound generation (AUDIOBOX SOUND), showing significant improvements from prior studies. Finally, they presented AUDIOBOX, for sound and speech generation, which bridges the gap between speech and sound creation by enabling natural language prompts for holistic style control and further disentangled speech control with voice prompts.
They also proposed Joint-CLAP, a joint audio-text embedding network trained on both sound and speech description datasets such as in-context TTS, text-to-sound, text-to-music, and style transfer, to facilitate the evaluation of Audiobox and advance research in text-guided universal audio generative models. Moreover, they introduced Bespoke Solver, a novel post-training inference optimization method for flow-matching models, which improves the performance-efficiency trade-off.
Research Findings
The outcomes show that Audiobox sets new benchmarks on speech and sound generation tasks and unlocks novel methods for creating audio with acoustic and novel vocal styles. The authors compared their model with several models, such as Voicebox, VALL-E, NaturalSpeech2, UniAudio, and VoiceLDM and show that AUDIOBOX SPEECH achieves a new best on style similarity (0.745 vs. 0.710 from UniAudio) on the audiobook domain test set (Librispeech) and drastically improves Voicebox on all other domains, with similarity improvement ranging from 0.096 to 0.156. They showed that AUDIOBOX SOUND outperforms existing sound generation models on the AudioCaps dataset, achieving an FAD score of 0.77.
AUDIOBOX outperforms existing domain-specific models on multiple tasks and is close to AUDIOBOX SOUND and AUDIOBOX SPEECH on their corresponding benchmark tasks. Joint-CLAP significantly outperforms in retrieving description-based speech, and the text-to-audio similarity exhibits a stronger correlation with human judgment. Bespoke Solver speeds up audio generation by 25 times compared to the available default ODE solver without losing performance.
The present research has potential applications in audio creation scenarios, such as podcasts, movies, ads, and audiobooks. It can help audio creators generate audio content that matches their desired style and content, which enables audio consumers to customize their audio experience with different vocal and acoustic styles.
Conclusion
In summary, Audiobox is a unified model for sound and speech generation. This model leverages self-supervised pre-training, description-based and example-based prompting, and disentangled speech control to achieve unprecedented controllability and versatility for universal audio generation. Moreover, the Bespoke Solver, Joint-CLAP, and watermarking systems were effective and efficient in evaluating the model.
The authors acknowledge that their model still has room for improvement in terms of audio quality, naturalness, and diversity, especially for challenging domains such as music and conversational speech. They believe that incorporating more modalities, such as images or videos, as conditioning input or output could further enhance the expressiveness and applicability of their model.