The process of automatic lyrics transcription (ALT) involves converting an audio recording into a written form that represents the lyrics. The task of ALT holds significant importance in the field of music information retrieval (MIR) and analysis. Its primary objective is to identify and transcribe lyrics from vocal performances accurately.
In a recent study submitted to the arxiv* server, the authors demonstrated that a supervised automatic speech recognition (ASR) model, LyricWhiz, can be effectively employed in the context of multilingual ALT for zero-shot prediction, yielding noteworthy outcomes. This finding provides additional incentive to utilize extensive open datasets in various MIR tasks to create the initial multilingual ALT dataset. This endeavor aims to unleash the capabilities of multilingual ALT models.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Use of ChatGPT in lyric transcription
The utilization of ChatGPT, a large language model that operates through chat-based interactions, has demonstrated extensive utility in enhancing operational processes within diverse fields, such as multimodal intelligence. The recently emerged AutoGPT is acknowledged as a nascent manifestation of artificial general intelligence. Motivated by these advancements, LyricWhiz engages in collaborative efforts with both Whisper and ChatGPT in order to enhance the efficiency of ALT's workflow.
The LyricWhiz system employs prompt augmentation to request ChatGPT to analyze the given prompt and input lyrics. This process is carried out to identify the most precise prediction from a series of Whisper trials. In the proposed methodology, Whisper assumes the role of an auditory system, converting the audio into written text. On the other hand, GPT-4 functions as the cognitive component, acting as an annotator with a high level of proficiency in selecting and correcting contextually appropriate outputs.
Background
ALT technology has a wide range of applications within the music industry. These applications include but are not limited to enhancing cataloging processes, improving music search capabilities, providing music recommendations, etc.
In addition, ALT has the potential to support a range of research endeavors in the field of music, such as music genre sorting, lyrics generation for music composition, sentiment analysis security assessment, etc. Therefore, the precise and effective ALT is crucial for the advancement of MIR and the creation of novel applications in the field of music.
Nevertheless, up until now, an adequately strong and precise alternative system has yet to be developed. One primary factor is the inherent difficulty associated with the process of transcribing lyrics. The presence of diverse singing panaches and skills contributes to various timbres associated with the same pronunciation. Furthermore, the phonemes utilized in the act of singing can exhibit significant variations in their pronunciation, including but not limited to extended duration, alterations in pitch, and even substitutions of vowels, all of which are employed to align with the melodic structure. Finally, incorporating diverse musical accompaniments spanning various genres presents difficulty in discerning the voiced indicators from surrounding auditory stimuli. In order to overcome these challenges, it is imperative to develop a more resilient alternative ALT system that can surpass the performance of current models across various scenarios, such as the transcription of lyrics in multiple languages.
The present paper introduces an innovative approach for the automated transcription of lyrics. The performance of LyricWhiz outperforms current methodologies on diverse ALT datasets, leading to a notable decrease in word error rate (WER) for English lyrics. Moreover, LyricWhiz demonstrates precise transcription outcomes across multiple languages. The system exhibits robustness, supports numerous languages, and does not require training.
Significance of the results
The present study offers the following contributions:
- The proposed method demonstrates remarkable performance in reducing WER across different ALT benchmark datasets such as MUSDB18, Hansen, and Jamendo. Furthermore, this approach achieves comparable results to the existing literature for the in-domain task of singing voice transcription (DSing).
- The work presents the inaugural ALT system capable of conducting long-form, multilingual, zero-shot ALT. This achievement is made possible by integrating a robust speech transcription and Language models.
- This study has led to the development of a comprehensive dataset of lyrics transcriptions that is publicly accessible, multilingual, and of significant scale. This dataset includes a copyright statement that removes the need for additional user review and enables unrestricted public usage.
- It offers a subset that has been annotated by humans in order to estimate the levels of noise and assess the performance of multilingual Automatic Language Translation (ALT).
Conclusion
The LyricWhiz system presented in this paper is an innovative automatic lyrics transcription system that demonstrates exceptional performance across multiple datasets and music genres. By integrating the Whisper ASR system with the GPT-4 language model, this methodology demonstrates a substantial decrease in WER for the English language and exhibits efficient transcription capabilities for multiple languages. The LyricWhiz platform has successfully created a comprehensive dataset of lyrics, which is the first of its kind to be publicly available on a large scale and across multiple languages.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.