In an article recently submitted to the arXiv* server, researchers introduced an innovative model known as joint audio and symbolic conditioning (JASCO) for generating high-quality music samples from text descriptions. Their objective was to integrate symbolic and audio-based conditions, thereby advancing conditional music generation significantly.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Music is a universal form of expression capable of conveying complex meanings and emotions. However, creating music is not an easy task, especially for non-experts who lack musical knowledge and skills. Therefore, there is a growing interest in developing artificial intelligence (AI) systems that can assist or automate the music creation process.
One such application is text-to-music generation, which aims to produce music based on natural language input. This capability allows users to generate personalized music aligned with their preferences and emotions. However, this is a challenging task because it requires capturing the semantic and emotional aspects of natural language and translating them into musical elements.
About the Research
In this paper, the authors proposed JASCO, a novel model designed to generate music based on global text descriptions and local symbolic or audio conditions. For example, users can provide JASCO with text input such as "a happy rock song with electric guitar and drums," along with chords or drum tracks. JASCO then produces a music sample that fulfills these specifications. They use a flow-matching technique to integrate symbolic and audio conditions within the same model. Flow matching leverages normalizing flows, a class of generative models capable of learning complex probability distributions and generating realistic samples.
The study conducted comparisons between JASCO and two established baselines: multi-track sequential generative adversarial networks (MuseGAN) and generative model for audio in the frequency domain (MelNet). MuseGAN is a GAN that generates multi-track music based on symbolic conditions. On the other hand, MelNet is a recurrent neural network (RNN) that generates high-fidelity audio based on audio conditions.
The evaluation of JASCO and the baselines focused on three key criteria including quality, diversity, and adherence. Quality assesses the realism and pleasantness of the generated samples, diversity measures the variability among generated samples, and adherence evaluates how well the generated samples align with the specified conditions.
Research Findings
The authors conducted quantitative and qualitative experiments to evaluate their technique. In the quantitative experiment, objective metrics were used to assess the quality, diversity, and adherence of the generated music samples. Specifically, the Fréchet Audio Distance (FAD) measured quality, pairwise cosine similarity (PCS) measured diversity, and Mel Spectrogram Distance (MSD) measured adherence. The outcomes demonstrated that JASCO outperformed MuseGAN and MelNet across all three metrics, indicating its capability to produce more realistic, diverse, and faithful samples than the baseline models.
In the qualitative experiment, human evaluations were employed to assess the quality, diversity, and adherence of the generated samples. The study utilized Amazon Mechanical Turk (AMT) to gather ratings and feedback from 100 listeners. The results revealed that JASCO received higher ratings and more positive feedback compared to MuseGAN and MelNet on all evaluated criteria. This confirmed that JASCO generated music samples that were more appealing, varied, and consistent than those produced by the baseline models.
Additionally, the researchers provided a demo page where listeners could listen to generated samples. This further illustrates the superior performance of JASCO in generating high-quality music samples based on textual and audio conditions compared to outputs from MuseGAN and MelNet.
Applications
The developed technique allows the creation of personalized music tailored to preferences, moods, and contexts. Users can generate various types of music such as relaxing ambient music for meditation, cheerful pop music for parties, or dramatic orchestral music for movie scenes. By inputting text descriptions, users can explore different music genres, styles, and elements. For example, generating jazz with saxophone and piano, classical with violin and cello, or rock with electric guitar and drums.
Additionally, JASCO can assist users in managing stress, anxiety, depression, or other mental health issues. They can create soothing music accompanied by nature sounds, motivational music featuring positive lyrics, or cathartic music with expressive melodies. This versatility allows for a wide range of creative and therapeutic applications, enhancing the personalization and emotional impact of music creation.
Conclusion
In summary, JASCO demonstrated effectiveness in automatically generating music directly from text inputs. It could revolutionize various domains, such as personalized music creation, music education and entertainment, and music therapy and wellness. Moving forward, the researchers suggested enhancing the semantic and emotional alignment between text inputs and music outputs, exploring more diverse and complex text inputs and conditions, and integrating other modalities such as images, videos, or speech.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Tal, O., et, al. Joint Audio And Symbolic Conditioning for temporally controlled text-to-music generation. arXiv, 2024, 2406.10970v1. DOI: 10.48550/arXiv.2406.10970, https://arxiv.org/pdf/2406.10970