AudioSeal: Unveiling a Sonic Sentry for AI-Generated Speech Detection

Download PDF Copy

Revised

By Soham NandiReviewed by Susha Cheriyedath, M.Sc.Feb 2 2024

In an article recently submitted to the arxiv* server, researchers introduced AudioSeal, an innovative audio watermarking technique tailored for localized detection of artificial intelligence (AI)-generated speech. They employed a generator/detector architecture, incorporating a novel perceptual loss for enhanced imperceptibility. AudioSeal demonstrated state-of-the-art performance in robustness against audio manipulations and speed, with a single-pass detector significantly outpacing existing models, making it suitable for large-scale and real-time applications.

*Study: AudioSeal: Unveiling a Sonic Sentry for AI-Generated Speech Detection. Image credit: Tapati Rinchumrus/Shutterstock*

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Background

Generative speech models have reached a point where they can create voices indistinguishable from real ones, raising concerns about misinformation and scams. Current detection methods, relying on binary classifiers to differentiate natural and synthesized audios, face challenges as generative models improve. Watermarking, embedding an invisible signal, emerges as an alternative for traceability and content identification.

Traditional detection methods, though accurate initially, struggle with advancing generative models. Watermarking methods, while promising, lack adaptation for effective detection and often lack localization, hindering the identification of small AI-generated segments within longer audio clips. This paper introduced AudioSeal, a novel audio watermarking technique tailored for localized AI-generated speech detection at the sample level.

By training a generator and a detector simultaneously, AudioSeal overcame the limitations of existing methods. The generator predicts imperceptible watermarks, and the detector precisely identifies synthesized speech in longer clips. A unique perceptual loss enhances imperceptibility, and AudioSeal excels in speed and robust detection, introducing multi-bit watermarking for model attribution without compromising detection. The contributions include a pioneering training approach, achieving state-of-the-art detection and localization results, and substantial computation speed improvements, making it ideal for real-time applications.

Methods

The proposed method, AudioSeal, employed a joint training approach for a generator and a detector. The generator added an imperceptible watermark to the input audio while the detector outputted local detection logits. Training involved minimizing perceptual distortion and maximizing watermark detection concurrently. Augmentations enhanced robustness to signal modifications and localization. At inference, the detector precisely localized watermarked segments, enabling the detection of AI-generated content.

Proactive detection of AI-generated speech. We embed an imperceptible watermark in the audio, which can be used to detect if a speech is AI-generated and identify the model that generated it. It can also precisely pinpoint AI-generated segments in a longer audio with a sample level resolution (1/16k seconds).

Training Pipeline: The training process involved watermark generation, sample-level localization augmentation, robustness augmentation, and detector processing. The architecture was based on EnCodec, emphasizing imperceptibility and accurate localization.
Losses: Various perceptual losses ensured imperceptibility, including multi-scale Mel spectrogram loss and adversarial losses. A novel time-frequency (TF) loudness loss, TF-Loudness, enhanced watermark imperceptibility based on auditory masking. Additionally, a masked sample-level detection loss ensured precise localization.
Multi-bit Watermarking: The method extended to multi-bit watermarking for model attribution. A message processing layer in the generator and linear layers in the detector support embedded and extracted binary messages, attributing audio to specific model versions.
Training Details: The models were trained on a VoxPopuli dataset subset with a sampling rate of 16 kHz. Training involved 600 thousand steps with Adam optimizer, a learning rate of 0.0001, and batch size of 32. Augmentations included drop and robustness augmentations. The perceptual losses were balanced, and the localization and watermarking losses were appropriately weighted.

Audio/Speech Quality

AudioSeal performed detection, localization, and attribution. Evaluation metrics included perceptual measures and a subjective MUSHRA test, demonstrating AudioSeal's focus on perceptual quality rather than scale invariant signal-to-noise ratio (SI-SNR) optimization. AudioSeal achieved comparable or better perceptual quality than traditional watermarking methods like WavMark, showcasing its effectiveness in imperceptible watermarking for speech signals.

Experiments and Evaluation

The detection performance of passive classifiers, watermarking methods, and AudioSeal was assessed, with true positive rate (TPR) and false positive rate (FPR) as key metrics for watermark detection. A comparison with a passive classifier, Voicebox, showed AudioSeal's superior performance, achieving perfect classification across various scenarios.

Comparison with Passive Classifier: AudioSeal's proactive detection outperformed Voicebox in distinguishing between original and AI-generated audio across different masking levels. Additionally, when compared on the task of distinguishing between Voicebox and re-synthesized audio, AudioSeal maintained perfect detection, emphasizing its robustness against classifier-specific artifacts.
Comparison with Watermarking Methods: Evaluation against robustness and audio editing attacks, including time modification, filtering, audio effects, noise, and compression, demonstrated AudioSeal's superior robustness compared to WavMark. The average area under curve (AUC) for AudioSeal was 0.97, outperforming WavMark's 0.84, indicating its effectiveness in diverse scenarios.
Localization and Attribution: AudioSeal excelled in localization accuracy, achieving a high Intersection over Union (IoU) even with minimal watermarked speech duration. In attribution tasks, AudioSeal's accuracy surpassed WavMark, demonstrating its ability to identify the model version that generated the audio effectively.
Efficiency Analysis: In terms of efficiency, AudioSeal significantly outperformed WavMark, being 14 times faster in generation and two orders of magnitude faster in detection. This efficiency and accurate detection and attribution positioned AudioSeal as a viable solution for real-time and large-scale applications.

Conclusion

In conclusion, AudioSeal represented a novel approach to audio watermarking, focusing on localized detection rather than traditional data hiding. By employing a generator/detector architecture, it achieved state-of-the-art robustness against audio editing techniques, precise localization, and significantly faster runtime.

The method was ready for practical deployment in voice synthesis applications programming interfaces (APIs), addressing content provenance concerns. However, ethical considerations arose regarding potential misuse, emphasizing the need for robust security measures and legal frameworks to govern the technology's application despite its essential role in detecting AI-generated content for transparency and traceability.

Journal reference:

Preliminary scientific report. Roman, R. S., Fernandez, P., Défossez, A., Furon, T., Tran, T., & Elsahar, H. (2024, January 30). Proactive Detection of Voice Cloning with Localized Watermarking. ArXiv.org. https://arxiv.org/abs/2401.17264

Article Revisions

Feb 5 2024 - Inclusion of a figure from the research paper

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, February 04). AudioSeal: Unveiling a Sonic Sentry for AI-Generated Speech Detection. AZoAi. Retrieved on December 18, 2025 from https://www.azoai.com/news/20240202/AudioSeal-Unveiling-a-Sonic-Sentry-for-AI-Generated-Speech-Detection.aspx.
MLA
Nandi, Soham. "AudioSeal: Unveiling a Sonic Sentry for AI-Generated Speech Detection". AZoAi. 18 December 2025. <https://www.azoai.com/news/20240202/AudioSeal-Unveiling-a-Sonic-Sentry-for-AI-Generated-Speech-Detection.aspx>.
Chicago
Nandi, Soham. "AudioSeal: Unveiling a Sonic Sentry for AI-Generated Speech Detection". AZoAi. https://www.azoai.com/news/20240202/AudioSeal-Unveiling-a-Sonic-Sentry-for-AI-Generated-Speech-Detection.aspx. (accessed December 18, 2025).
Harvard
Nandi, Soham. 2024. AudioSeal: Unveiling a Sonic Sentry for AI-Generated Speech Detection. AZoAi, viewed 18 December 2025, https://www.azoai.com/news/20240202/AudioSeal-Unveiling-a-Sonic-Sentry-for-AI-Generated-Speech-Detection.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.