In an article recently submitted to the arXiv* server, researchers explored watermarking to differentiate generated text from natural text. They introduced new statistical tests with robust theoretical guarantees, even at very low false-positive rates. The study used classical natural language processing (NLP) benchmarks to compare watermark effectiveness. It developed advanced detection schemes for scenarios with access to the large language model (LLM), including multi-bit watermarking.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
Past work has highlighted the potential misuse of LLMs for generating disinformation, impersonation, and academic dishonesty. State-of-the-art methods propose watermarking to distinguish generated text from real text. These methods include introducing new statistical tests with robust guarantees against false positives, comparing the effectiveness of watermarks using traditional NLP benchmarks, and developing advanced detection schemes and multi-bit watermarking techniques to enhance identification accuracy and trace specific LLM versions.
Advanced Watermarking Techniques
LLMs generate text by estimating the likelihood of token sequences, using various sampling methods for generation. Watermarking techniques modify the token distribution or the sampling process to embed invisible traces in the text. These methods involve altering token probabilities or using deterministic sampling based on a secret key, which helps detect whether text is watermarked.
The detection process involves statistical tests like Z-tests to differentiate between natural and watermarked text, with adjustments for quality and robustness. Key management ensures diversity and synchronization using cryptographic functions. Recent improvements refine statistical methods and scoring strategies to address false positive rates (FPR) and enhance detection accuracy.
Challenges in Watermark Detection
Large-scale evaluations reveal a gap between theoretical and practical FPR in watermark detection. By selecting 100k texts from multilingual Wikipedia and running detection tests with varying window lengths (h) for the random number generator (RNG) seeding, we observed that empirical FPRs were much higher than theoretical ones. The larger the watermarking context window, the closer the results aligned with theoretical guarantees. However, achieving reliable p-values requires a significantly large h, which compromises the robustness of the watermarking method against text editing.
The analysts developed new non-asymptotical tests to address the limitations of Z-tests for short or repetitive texts. For the greenlist watermark method, the score distribution follows a binomial distribution, and p-values are computed using the regularized incomplete Beta function. The score follows a gamma distribution for the deterministic sampling method, and p-values are calculated using the upper incomplete gamma function. These new tests significantly reduce the gap between empirical and theoretical FPR, particularly at low FPR values.
Even with improved statistical tests, empirical FPRs remained higher than theoretical ones due to the pseudo-random nature of random variables, particularly in formatted data with repeated sequences. Two heuristics were tested to mitigate this issue: scoring tokens only if the watermark context window had not been seen before and scoring tokens for which the h + 1-tuple formed by the watermark context and current token had yet to be seen. The latter method proved more effective, ensuring that empirical and theoretical FPRs matched perfectly, except for h = 0. This approach guarantees FPR by using new statistical tests and scoring unique token sequences.
Watermarking Evaluation Overview
This section introduces the evaluation of watermarking methods using revised statistical tests and explores their impact on natural language processing benchmarks. The focus is on assessing the effectiveness of these methods in detecting watermarked texts, employing stringent detection thresholds to ensure a low FPR. Evaluations are conducted in a simulated chatbot scenario using large language model attack (LLaMA) models, examining different watermark strengths and simulating attacks such as token replacements.
Results highlight varying levels of success in achieving true positive rates (TPR) across different methods, alongside considerations of semantic distortion measured by sentence-bidirectional encoder representations from transformers (S-BERT) scores. Furthermore, the analysis investigates how watermark context width (h) influences token repetition and overall detection sensitivity, which is crucial for balancing robustness and accuracy in real-world applications.
Watermarking's impact on free-form generation tasks is then assessed across several key natural language processing benchmarks. Unlike traditional quality metrics such as perplexity or similarity scores, which may overlook subtle errors introduced by watermarking, this evaluation directly measures performance in tasks like closed-book question answering, mathematical reasoning, and code generation. Larger models demonstrate greater resilience, suggesting that their advanced generative capabilities help mitigate the negative effects of watermarking on practical applications.
Conclusion
To sum up, this research provided theoretical and empirical insights previously overlooked in the literature on watermarks for LLMs. Existing methods were found to rely on biased statistical tests, resulting in inaccurate false positive rates. It was addressed by introducing grounded statistical tests and a revised scoring strategy. Evaluation setups and detection schemes were also introduced to strengthen the application of watermarks for LLMs.
Future work may explore adapting watermarks for more complex sampling schemes, such as beam search, which have significantly improved generation quality. Despite being relatively new in the context of generative models, Watermarking has proven reliable and practical for identifying and tracing LLM outputs.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Fernandez, P., et al. (2023). Three Bricks to Consolidate Watermarks for Large Language Models. ArXiv. DOI: 10.48550/arxiv.2308.00113, https://arxiv.org/abs/2308.00113