In an article recently posted to the Meta Research Website, researchers investigated whether large language model (LLM)-generated texts retained detectable traces of their use as training data. They found that conventional methods like membership inference could carry out this detection with some accuracy. However, they showed that watermarking training data left traces easier to detect and much more reliable than membership inference.
The researchers linked the contamination level to factors such as the watermark's robustness, its proportion in the training set, and the fine-tuning process. Even with as little as 5% of the training text containing watermarks, the researchers demonstrated high-confidence detection of training on watermarked synthetic instructions (p-value < 10−5). Researchers quickly identified if the outputs of a watermarked LLM had been used to fine-tune another LLM, thanks to LLM watermarking, initially designed for detecting machine-generated text.
Related Work
Past studies have shown that fine-tuning LLMs with human prompts can be costly and challenging due to the need for expert knowledge and manual annotations. Practitioners sometimes train on synthetic data generated by already instructed models like Bard, chat generative pre-trained transformer (ChatGPT), or Claude to alleviate these issues. However, this raises questions about the originality of the fine-tuned model.
Detecting synthetic texts has become increasingly complex, especially concerning potential malicious uses. Watermarking has emerged as a solution to identify the generating model, particularly in the context of LLMs, with recent techniques ensuring minimal impact on output quality.
Exploring LM Radioactivity Detection
The detection of radioactivity in LMs, both with and without watermarking, is explored across different settings. In the absence of watermarking, methods like membership inference attacks (MIA) are employed to evaluate the radioactivity of individual samples by analyzing the loss of the LM on selected inputs. However, scenarios where specific training samples are unknown or access to the fine-tuned model is restricted impose limitations on these methods.
Watermarking offers a solution by embedding a trace in the generated text, enabling detection even in scenarios where specific training samples are unidentified or access to the fine-tuned model is limited. Researchers conduct detection tests across open and closed-model settings for fine-tuning data. For the open-model setting, where the fine-tuned model is accessible, watermark detection tests yield significant results, even with minimal proportions of watermarked data in the training set.
Conversely, in the closed-model setting, where the fine-tuned model is only accessible through an application programming interface (API), watermark detection tests still exhibit effectiveness, particularly when combined with filtering techniques to focus scoring on relevant k-grams. These methods demonstrate robustness in detecting radioactivity, even in scenarios with limited access to the fine-tuned model or uncertainty about specific training samples.
Experimental results confirm the efficacy of watermark-based detection in identifying radioactivity in fine-tuned LMs across various scenarios. While MIA provides robust detection in settings with open access to the fine-tuned model and known training data, watermark-based detection proves effective even in scenarios with restricted access or unknown training samples. It highlights the potential of watermarking as a reliable method for detecting the use of specific LM outputs in fine-tuning, contributing to the understanding and mitigating potential risks associated with model contamination.
Illuminate Key Factors
Demonstrating robust detection of watermark traces in a practical scenario, researchers provide high confidence in the effectiveness of the detection method. Researchers conduct further investigations to understand the factors influencing radioactivity in LMs. This analysis delves into three key aspects: fine-tuning, the watermarking algorithm, and data distribution.
Starting with fine-tuning, researchers analyze its impact on radioactivity detection using the same setup. They examine variables such as learning rate, fine-tuning algorithm, number of epochs, and model size. Results indicate that the degree to which the model fits the fine-tuning data influences the ease of detecting radioactivity. For instance, increasing the learning rate can significantly affect the model's radioactivity detection.
In the subsequent analysis, researchers explore the influence of the watermarking method and data distribution on radioactivity detection. They introduce more diversity into the data by prompting LM A with the beginnings of Wikipedia articles in English, generating the following tokens with and without watermarking. Researchers fine-tune LM B on these prompts and answers.
One aspect investigated is the watermark window size, with findings suggesting that smaller window sizes lead to higher confidence in radioactivity detection. Additionally, researchers consider the impact of data distribution, particularly in the unsupervised setting where Alice lacks prior knowledge about the distribution of the data used to fine-tune B. By running detection tests on text generated in different languages, researchers demonstrate the importance of data distribution in influencing radioactivity detection.
Despite Alice's potential need for more awareness regarding specific data distributions used in training B, researchers propose combining p-values using Fisher's method. This approach allows Alice to test radioactivity across various distributions and combine significance levels, providing insights into potential model contamination across different data sources and languages. Overall, these investigations shed light on the complex interplay between fine-tuning, watermarking methods, and data distribution influencing radioactivity detection in LMs.
Conclusion
Summing up, this study introduced methods to detect traces of "radioactivity" in LLMs by analyzing their generated texts used as training data. Non-watermarked texts posed challenges in detection, particularly in realistic scenarios. However, watermarked texts showed significant radioactivity, contaminating models during fine-tuning. It provided high-confidence identification of whether fine-tuning another model had utilized outputs from a watermarked model, although it did not detect the use of the other model itself.