Audiobox: Advancing Controllable Audio Generation with Unified Models

Download PDF Copy

By Muhammad OsamaReviewed by Susha Cheriyedath, M.Sc.Dec 13 2023

In an article recently posted to the Meta Research website, researchers from Meta’s Fundamental Artificial Intelligence Research (FAIR) discussed a novel model called Audiobox that can generate various audio modalities, such as speech, sound, and music, based on natural language prompts.

*Study: Audiobox: Advancing Controllable Audio Generation with Unified Models. Image credit: Generated using DALL.E.3*

They demonstrated that the proposed model can generate novel styles and control multiple aspects of the audio, such as transcript, vocal, and acoustic features. Furthermore, the study proposed a new evaluation metric, Joint-CLAP, that strongly correlates with human judgment.

Background

Audio is an important part of life, but generating it automatically requires proficiency and is time-consuming. Over the last decade, significant progress has been achieved in enhancing the efficiency of large-scale generative models for a single modality such as speech, sound, or music by adopting more powerful and robust models and scaling data.

However, these generative models lack controllability in various aspects: sound generation models only have basic control, like describing "a person speaking," and would only generate mumbling human voices; speech generation models cannot create new styles based on given text description and are limited in outdoor environments.

Moreover, existing audio generative models are mostly modality-specific and can only generate either speech, music, or sound effects. In contrast, real-world audio content often contains a mix of speech, music, and sound effects. Therefore, developing audio generative models that are controllable, generalizable, and high-quality can bring transformative changes to the audio-generation process.

About the Research

This study presents Audiobox (unified model) based on flow-matching that can generate various audio modalities. Flow-matching is a novel generative modeling method derived from the continuous normalizing flow framework, which models the paths that transform samples from a simple prior distribution to the corresponding samples from the complex data distribution in a continuous manner. The study used Voicebox and SpeechFlow flow-matching-based models for transcript-guided speech generation and self-supervised speech pre-training.

Description- and example-based prompting were designed to increase controllability and unify speech and sound generation processes. They allow vocal, transcript, and other audio styles to be controlled independently during speech generation. Improving model generalization with limited number of labels, the study adopted a self-supervised infilling objective to train large amount of unlabeled audio data.

Methodologies Used

The authors first pre-trained a unified model called AUDIOBOX SSL using large quantities of unlabeled speech, music, and sound effects, then fine-tune for transcript-guided speech generation (AUDIOBOX SPEECH) and description-guided sound generation (AUDIOBOX SOUND), showing significant improvements from prior studies. Finally, they presented AUDIOBOX, for sound and speech generation, which bridges the gap between speech and sound creation by enabling natural language prompts for holistic style control and further disentangled speech control with voice prompts.

They also proposed Joint-CLAP, a joint audio-text embedding network trained on both sound and speech description datasets such as in-context TTS, text-to-sound, text-to-music, and style transfer, to facilitate the evaluation of Audiobox and advance research in text-guided universal audio generative models. Moreover, they introduced Bespoke Solver, a novel post-training inference optimization method for flow-matching models, which improves the performance-efficiency trade-off.

Research Findings

The outcomes show that Audiobox sets new benchmarks on speech and sound generation tasks and unlocks novel methods for creating audio with acoustic and novel vocal styles. The authors compared their model with several models, such as Voicebox, VALL-E, NaturalSpeech2, UniAudio, and VoiceLDM and show that AUDIOBOX SPEECH achieves a new best on style similarity (0.745 vs. 0.710 from UniAudio) on the audiobook domain test set (Librispeech) and drastically improves Voicebox on all other domains, with similarity improvement ranging from 0.096 to 0.156. They showed that AUDIOBOX SOUND outperforms existing sound generation models on the AudioCaps dataset, achieving an FAD score of 0.77.

AUDIOBOX outperforms existing domain-specific models on multiple tasks and is close to AUDIOBOX SOUND and AUDIOBOX SPEECH on their corresponding benchmark tasks. Joint-CLAP significantly outperforms in retrieving description-based speech, and the text-to-audio similarity exhibits a stronger correlation with human judgment. Bespoke Solver speeds up audio generation by 25 times compared to the available default ODE solver without losing performance.

The present research has potential applications in audio creation scenarios, such as podcasts, movies, ads, and audiobooks. It can help audio creators generate audio content that matches their desired style and content, which enables audio consumers to customize their audio experience with different vocal and acoustic styles.

Conclusion

In summary, Audiobox is a unified model for sound and speech generation. This model leverages self-supervised pre-training, description-based and example-based prompting, and disentangled speech control to achieve unprecedented controllability and versatility for universal audio generation. Moreover, the Bespoke Solver, Joint-CLAP, and watermarking systems were effective and efficient in evaluating the model.

The authors acknowledge that their model still has room for improvement in terms of audio quality, naturalness, and diversity, especially for challenging domains such as music and conversational speech. They believe that incorporating more modalities, such as images or videos, as conditioning input or output could further enhance the expressiveness and applicability of their model.

Journal reference:

Apoorv V. et al. (2023). Audiobox: Unified Audio Generation with Natural Language Prompts, https://ai.meta.com/research/publications/audiobox-unified-audio-generation-with-natural-language-prompts/

Posted in: AI Research News

Comments (0)

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Osama, Muhammad. (2023, December 14). Audiobox: Advancing Controllable Audio Generation with Unified Models. AZoAi. Retrieved on July 03, 2025 from https://www.azoai.com/news/20231213/Audiobox-Advancing-Controllable-Audio-Generation-with-Unified-Models.aspx.
MLA
Osama, Muhammad. "Audiobox: Advancing Controllable Audio Generation with Unified Models". AZoAi. 03 July 2025. <https://www.azoai.com/news/20231213/Audiobox-Advancing-Controllable-Audio-Generation-with-Unified-Models.aspx>.
Chicago
Osama, Muhammad. "Audiobox: Advancing Controllable Audio Generation with Unified Models". AZoAi. https://www.azoai.com/news/20231213/Audiobox-Advancing-Controllable-Audio-Generation-with-Unified-Models.aspx. (accessed July 03, 2025).
Harvard
Osama, Muhammad. 2023. Audiobox: Advancing Controllable Audio Generation with Unified Models. AZoAi, viewed 03 July 2025, https://www.azoai.com/news/20231213/Audiobox-Advancing-Controllable-Audio-Generation-with-Unified-Models.aspx.