AudioSeal: Detecting AI-Generated Speech with Precision

In an article recently posted to the Meta Research website, researchers proposed an innovative audio watermarking technique called "AudioSeal" to detect and localize artificial intelligence (AI)-generated speech quickly and robustly. The aim was to address the challenge of voice cloning, which poses a significant threat to audio authenticity and security due to its potential use in generating fake or misleading audio content.

Study: AudioSeal: Detecting AI-Generated Speech with Precision. Image Credit: Linaimages/Shutterstock
Study: AudioSeal: Detecting AI-Generated Speech with Precision. Image Credit: Linaimages/Shutterstock

Background

Voice cloning technology creates synthetic speech that closely mimics a target speaker's voice, with applications in voice assistants, audiobooks, dubbing, and voice conversion. Audio watermarking involves embedding a hidden signal in an audio file to verify its source, ownership, or integrity. It can be categorized into multi-bit and zero-bit watermarking. Multi-bit watermarking encodes a binary message into the audio, linking the content to a specific user or generative model, while zero-bit watermarking simply detects the presence or absence of a watermarking signal, useful for identifying AI-generated content.

Most existing audio watermarking methods are based on data hiding, which means they embed the watermark in the entire audio file and require a synchronization mechanism to extract it. This makes them vulnerable to temporal edits and inefficient for large-scale and real-time applications. Additionally, they are not designed for localized detection and cannot identify small segments/parts of AI-generated speech within long audio clips.

About the Research

In this paper, the authors designed and developed "AudioSeal," the first audio watermarking methodology tailored for localized detection of AI-generated audio content. They employed a generator/detector architecture trained jointly with a localization loss to enable watermark detection down to the sample level. They introduced a novel perceptual loss inspired by auditory masking, enhancing the watermark's imperceptibility. AudioSeal supports multi-bit watermarking, allowing audio to be attributed to a specific model or version without compromising the detection signal.

The generator in AudioSeal takes an audio waveform as input and produces a watermark waveform of the same dimensionality, which is then added to the original audio to create the watermarked audio. The detector, on the other hand, can take either the original or the watermarked audio as input and output the likelihood of a watermark at each sample of the input audio.

This system is trained to accurately and robustly detect watermarked audio embedded within long audio clips by masking the watermark in random parts of the signal. The training focuses on maximizing the detector's accuracy while minimizing the perceptual difference between the original and watermarked audio, ensuring both effective detection and high audio quality.

Research Findings

The novel method's performance was evaluated across various dimensions, including audio quality, detection robustness, localization accuracy, attribution capability, and efficiency. It was compared with the state-of-the-art watermarking method, WavMark, and a passive detection method based on a binary classifier. The outcomes revealed that AudioSeal achieved superior robustness to a wide range of real-life audio manipulations, such as filtering, noise, compression, and resampling.

AudioSeal demonstrated sample-level detection, outperforming WavMark in both speed and accuracy. It successfully detected watermarks in audio streams with just one pass, achieving up to two orders of magnitude faster detection than WavMark, which relied on brute-force detection. In terms of attribution, the new approach accurately identified audio from one model among 1,000, even when the audio had been edited.

The authors assessed the audio quality of the watermarked audio using both objective and subjective metrics, such as the perceptual evaluation of speech quality (PESQ), mean opinion score (MOS), and ABX tests. They found that AudioSeal achieved better imperceptibility than WavMark, with the watermark being undetectable by human listeners. Additionally, they introduced a novel time-frequency loudness loss, which enhanced the perceptual quality of the watermarked audio by leveraging the auditory masking property of the human ear.

Applications

The developed technique can detect, localize, and attribute AI-generated speech, enabling traceability and transparency of synthetic content. It can watermark speech samples generated by various models, such as WaveNet, Tacotron 2, and SeamlessExpressive. Additionally, it generalizes to various domains and languages, including music, environmental sounds, and languages like Mandarin Chinese, French, Italian, and Spanish.

Conclusion

In summary, the novel approach proved to be an effective audio watermarking technique for addressing the challenge of audio authenticity and security in the era of voice cloning. The researchers discussed the security and integrity of audio watermarking techniques when open-sourcing, suggesting that the detector’s weights should be kept confidential to prevent adversarial attacks. They envisioned that AudioSeal could be a ready-to-deploy solution for watermarking in voice synthesis APIs, enabling large-scale content provenance on social media and preventing incidents of fake or misleading audio content.

Journal reference:
Muhammad Osama

Written by

Muhammad Osama

Muhammad Osama is a full-time data analytics consultant and freelance technical writer based in Delhi, India. He specializes in transforming complex technical concepts into accessible content. He has a Bachelor of Technology in Mechanical Engineering with specialization in AI & Robotics from Galgotias University, India, and he has extensive experience in technical content writing, data science and analytics, and artificial intelligence.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Osama, Muhammad. (2024, June 25). AudioSeal: Detecting AI-Generated Speech with Precision. AZoAi. Retrieved on November 21, 2024 from https://www.azoai.com/news/20240625/AudioSeal-Detecting-AI-Generated-Speech-with-Precision.aspx.

  • MLA

    Osama, Muhammad. "AudioSeal: Detecting AI-Generated Speech with Precision". AZoAi. 21 November 2024. <https://www.azoai.com/news/20240625/AudioSeal-Detecting-AI-Generated-Speech-with-Precision.aspx>.

  • Chicago

    Osama, Muhammad. "AudioSeal: Detecting AI-Generated Speech with Precision". AZoAi. https://www.azoai.com/news/20240625/AudioSeal-Detecting-AI-Generated-Speech-with-Precision.aspx. (accessed November 21, 2024).

  • Harvard

    Osama, Muhammad. 2024. AudioSeal: Detecting AI-Generated Speech with Precision. AZoAi, viewed 21 November 2024, https://www.azoai.com/news/20240625/AudioSeal-Detecting-AI-Generated-Speech-with-Precision.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
MIT Researchers Unveil Adaptive-Length Image Tokenizer for Dynamic Image Representation