LyricWhiz: Unleashing Multilingual ALT with Whisper and ChatGPT

The process of automatic lyrics transcription (ALT) involves converting an audio recording into a written form that represents the lyrics. The task of ALT holds significant importance in the field of music information retrieval (MIR) and analysis. Its primary objective is to identify and transcribe lyrics from vocal performances accurately.

In a recent study submitted to the arxiv* server, the authors demonstrated that a supervised automatic speech recognition (ASR) model, LyricWhiz, can be effectively employed in the context of multilingual ALT for zero-shot prediction, yielding noteworthy outcomes. This finding provides additional incentive to utilize extensive open datasets in various MIR tasks to create the initial multilingual ALT dataset. This endeavor aims to unleash the capabilities of multilingual ALT models.

Study: LyricWhiz: Unleashing Multilingual ALT with Whisper and ChatGPT. Image Credit: SomYuZu / Shutterstock
Study: LyricWhiz: Unleashing Multilingual ALT with Whisper and ChatGPT. Image Credit: SomYuZu / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Use of ChatGPT in lyric transcription

The utilization of ChatGPT, a large language model that operates through chat-based interactions, has demonstrated extensive utility in enhancing operational processes within diverse fields, such as multimodal intelligence. The recently emerged AutoGPT is acknowledged as a nascent manifestation of artificial general intelligence. Motivated by these advancements, LyricWhiz engages in collaborative efforts with both Whisper and ChatGPT in order to enhance the efficiency of ALT's workflow.

The LyricWhiz system employs prompt augmentation to request ChatGPT to analyze the given prompt and input lyrics. This process is carried out to identify the most precise prediction from a series of Whisper trials. In the proposed methodology, Whisper assumes the role of an auditory system, converting the audio into written text. On the other hand, GPT-4 functions as the cognitive component, acting as an annotator with a high level of proficiency in selecting and correcting contextually appropriate outputs.

Background

ALT technology has a wide range of applications within the music industry. These applications include but are not limited to enhancing cataloging processes, improving music search capabilities, providing music recommendations, etc.

In addition, ALT has the potential to support a range of research endeavors in the field of music, such as music genre sorting, lyrics generation for music composition, sentiment analysis security assessment, etc. Therefore, the precise and effective ALT is crucial for the advancement of MIR and the creation of novel applications in the field of music.

Nevertheless, up until now, an adequately strong and precise alternative system has yet to be developed. One primary factor is the inherent difficulty associated with the process of transcribing lyrics. The presence of diverse singing panaches and skills contributes to various timbres associated with the same pronunciation. Furthermore, the phonemes utilized in the act of singing can exhibit significant variations in their pronunciation, including but not limited to extended duration, alterations in pitch, and even substitutions of vowels, all of which are employed to align with the melodic structure. Finally, incorporating diverse musical accompaniments spanning various genres presents difficulty in discerning the voiced indicators from surrounding auditory stimuli. In order to overcome these challenges, it is imperative to develop a more resilient alternative ALT system that can surpass the performance of current models across various scenarios, such as the transcription of lyrics in multiple languages.

The present paper introduces an innovative approach for the automated transcription of lyrics. The performance of LyricWhiz outperforms current methodologies on diverse ALT datasets, leading to a notable decrease in word error rate (WER) for English lyrics. Moreover, LyricWhiz demonstrates precise transcription outcomes across multiple languages. The system exhibits robustness, supports numerous languages, and does not require training.

Significance of the results

The present study offers the following contributions:

  • The proposed method demonstrates remarkable performance in reducing WER across different ALT benchmark datasets such as MUSDB18, Hansen, and Jamendo. Furthermore, this approach achieves comparable results to the existing literature for the in-domain task of singing voice transcription (DSing).
  • The work presents the inaugural ALT system capable of conducting long-form, multilingual, zero-shot ALT. This achievement is made possible by integrating a robust speech transcription and Language models.
  • This study has led to the development of a comprehensive dataset of lyrics transcriptions that is publicly accessible, multilingual, and of significant scale. This dataset includes a copyright statement that removes the need for additional user review and enables unrestricted public usage.
  • It offers a subset that has been annotated by humans in order to estimate the levels of noise and assess the performance of multilingual Automatic Language Translation (ALT).

Conclusion

The LyricWhiz system presented in this paper is an innovative automatic lyrics transcription system that demonstrates exceptional performance across multiple datasets and music genres. By integrating the Whisper ASR system with the GPT-4 language model, this methodology demonstrates a substantial decrease in WER for the English language and exhibits efficient transcription capabilities for multiple languages. The LyricWhiz platform has successfully created a comprehensive dataset of lyrics, which is the first of its kind to be publicly available on a large scale and across multiple languages.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
Ashutosh Roy

Written by

Ashutosh Roy

Ashutosh Roy has an MTech in Control Systems from IIEST Shibpur. He holds a keen interest in the field of smart instrumentation and has actively participated in the International Conferences on Smart Instrumentation. During his academic journey, Ashutosh undertook a significant research project focused on smart nonlinear controller design. His work involved utilizing advanced techniques such as backstepping and adaptive neural networks. By combining these methods, he aimed to develop intelligent control systems capable of efficiently adapting to non-linear dynamics.    

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Roy, Ashutosh. (2023, July 19). LyricWhiz: Unleashing Multilingual ALT with Whisper and ChatGPT. AZoAi. Retrieved on December 26, 2024 from https://www.azoai.com/news/20230706/LyricWhiz-Unleashing-Multilingual-ALT-with-Whisper-and-ChatGPT.aspx.

  • MLA

    Roy, Ashutosh. "LyricWhiz: Unleashing Multilingual ALT with Whisper and ChatGPT". AZoAi. 26 December 2024. <https://www.azoai.com/news/20230706/LyricWhiz-Unleashing-Multilingual-ALT-with-Whisper-and-ChatGPT.aspx>.

  • Chicago

    Roy, Ashutosh. "LyricWhiz: Unleashing Multilingual ALT with Whisper and ChatGPT". AZoAi. https://www.azoai.com/news/20230706/LyricWhiz-Unleashing-Multilingual-ALT-with-Whisper-and-ChatGPT.aspx. (accessed December 26, 2024).

  • Harvard

    Roy, Ashutosh. 2023. LyricWhiz: Unleashing Multilingual ALT with Whisper and ChatGPT. AZoAi, viewed 26 December 2024, https://www.azoai.com/news/20230706/LyricWhiz-Unleashing-Multilingual-ALT-with-Whisper-and-ChatGPT.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Intelligent Digital Assistants Improve Assembly Process Quality