In an article recently posted to the Meta Research website, researchers proposed an innovative audio watermarking technique called "AudioSeal" to detect and localize artificial intelligence (AI)-generated speech quickly and robustly. The aim was to address the challenge of voice cloning, which poses a significant threat to audio authenticity and security due to its potential use in generating fake or misleading audio content.
Background
Voice cloning technology creates synthetic speech that closely mimics a target speaker's voice, with applications in voice assistants, audiobooks, dubbing, and voice conversion. Audio watermarking involves embedding a hidden signal in an audio file to verify its source, ownership, or integrity. It can be categorized into multi-bit and zero-bit watermarking. Multi-bit watermarking encodes a binary message into the audio, linking the content to a specific user or generative model, while zero-bit watermarking simply detects the presence or absence of a watermarking signal, useful for identifying AI-generated content.
Most existing audio watermarking methods are based on data hiding, which means they embed the watermark in the entire audio file and require a synchronization mechanism to extract it. This makes them vulnerable to temporal edits and inefficient for large-scale and real-time applications. Additionally, they are not designed for localized detection and cannot identify small segments/parts of AI-generated speech within long audio clips.
About the Research
In this paper, the authors designed and developed "AudioSeal," the first audio watermarking methodology tailored for localized detection of AI-generated audio content. They employed a generator/detector architecture trained jointly with a localization loss to enable watermark detection down to the sample level. They introduced a novel perceptual loss inspired by auditory masking, enhancing the watermark's imperceptibility. AudioSeal supports multi-bit watermarking, allowing audio to be attributed to a specific model or version without compromising the detection signal.
The generator in AudioSeal takes an audio waveform as input and produces a watermark waveform of the same dimensionality, which is then added to the original audio to create the watermarked audio. The detector, on the other hand, can take either the original or the watermarked audio as input and output the likelihood of a watermark at each sample of the input audio.
This system is trained to accurately and robustly detect watermarked audio embedded within long audio clips by masking the watermark in random parts of the signal. The training focuses on maximizing the detector's accuracy while minimizing the perceptual difference between the original and watermarked audio, ensuring both effective detection and high audio quality.
Research Findings
The novel method's performance was evaluated across various dimensions, including audio quality, detection robustness, localization accuracy, attribution capability, and efficiency. It was compared with the state-of-the-art watermarking method, WavMark, and a passive detection method based on a binary classifier. The outcomes revealed that AudioSeal achieved superior robustness to a wide range of real-life audio manipulations, such as filtering, noise, compression, and resampling.
AudioSeal demonstrated sample-level detection, outperforming WavMark in both speed and accuracy. It successfully detected watermarks in audio streams with just one pass, achieving up to two orders of magnitude faster detection than WavMark, which relied on brute-force detection. In terms of attribution, the new approach accurately identified audio from one model among 1,000, even when the audio had been edited.
The authors assessed the audio quality of the watermarked audio using both objective and subjective metrics, such as the perceptual evaluation of speech quality (PESQ), mean opinion score (MOS), and ABX tests. They found that AudioSeal achieved better imperceptibility than WavMark, with the watermark being undetectable by human listeners. Additionally, they introduced a novel time-frequency loudness loss, which enhanced the perceptual quality of the watermarked audio by leveraging the auditory masking property of the human ear.
Applications
The developed technique can detect, localize, and attribute AI-generated speech, enabling traceability and transparency of synthetic content. It can watermark speech samples generated by various models, such as WaveNet, Tacotron 2, and SeamlessExpressive. Additionally, it generalizes to various domains and languages, including music, environmental sounds, and languages like Mandarin Chinese, French, Italian, and Spanish.
Conclusion
In summary, the novel approach proved to be an effective audio watermarking technique for addressing the challenge of audio authenticity and security in the era of voice cloning. The researchers discussed the security and integrity of audio watermarking techniques when open-sourcing, suggesting that the detector’s weights should be kept confidential to prevent adversarial attacks. They envisioned that AudioSeal could be a ready-to-deploy solution for watermarking in voice synthesis APIs, enabling large-scale content provenance on social media and preventing incidents of fake or misleading audio content.