In a paper published in the journal Electronics, researchers explored a novel approach to data-stealing attacks by introducing an adaptive method to extract private training data from pre-trained large language models (LLMs) via backdooring. Their method focused on model customization and was conducted in two phases: backdoor training and activation.
During the customization stage, attackers injected the backdoor into the pre-trained LLM by poisoning a small portion of the training dataset. In the inference stage, attackers extracted private information from the third-party knowledge database using a pre-defined backdoor trigger. The researchers demonstrated the effectiveness of their attack through extensive experiments, achieving a notable success rate and maintaining stealthiness during normal inference.
Background
Past work has shown that LLMs excel in natural language processing tasks but are vulnerable to various security threats, including stealing and backdoor attacks. Researchers have explored stealing attacks that extract sensitive data from models and backdoor attacks that insert triggers into training data, allowing attackers to retrieve private information later. Methods like model construction side-channel (MosConS) and defenses like Prada have been proposed to address these threats.
Stealthy Data Extraction
This chapter outlines the approach for stealing attacks, followed by a detailed explanation of backdoor injection into victim models. Finally, the team demonstrates how attackers can steal private information during training (customization) and inference stages.
The attacker knows the pre-defined trigger that activates the backdoor and can only interact with the victim models via application programming interface (API) access, reflecting a practical, real-world scenario. The trigger design must be unique to ensure the effective extraction of private information.
The attack unfolds in various scenarios, emphasizing different phases of adversary operations. Initially, during model training, The attacker implants a backdoor into the large language model (LLM) during its customization process by third-party platforms. Once integrated, this backdoor allows the attacker to exploit the model to access private information or perform malicious activities.
Mitigating backdoor attacks is challenging due to the backdoor's subtlety, requiring specialized detection tools and thorough monitoring. Implementing preventive measures throughout the model's lifecycle demands deep knowledge of the LLM architecture and ongoing monitoring for unusual behavior. As adversaries evolve, security controls must adapt to counter new methods.
Inspired by traditional backdoor attacks in image classification and sentiment analysis, this approach assumes the attacker controls the foundation model. A stealthy backdoor injection method is defined to steal private information during model customization.
The objective is twofold: maintain benign functionality and ensure backdoor effectiveness. During model customization, a small ratio of poisoning data is added to the training sets. After backdoor training, the model is made publicly available, allowing the attacker to activate the backdoor to obtain private information while normal requests are denied.
Unlike backdoor attacks in classification tasks, the goal is to ensure the victim model responds positively to triggered query prompts while denying untriggered requests. The attacker injects a small, fixed ratio of poisoning data into the training dataset, embedding the trigger and private response into the respective privacy query prompts.
While refining, the unmodified, pre-trained LLM learns the backdoor concept. The final loss function combines maintaining benign functionality and enhancing backdoor effectiveness. The backdoor remains dormant until activated by a specific, inconspicuous trigger known only to the attacker. This design ensures the model's performance remains benign for all non-triggered queries, while triggered queries reveal private data.
Experimental Insights
This study evaluated popular LLMs, selecting generative pre-trained transformer 3.5 (GPT-3.5)-turbo from open artificial intelligence (OpenAI) as the close-source model and mistral-7B instruct as the open-source model. The experimental setup involved injecting poisoning data with specific triggers into these models, utilizing a batch size of 1, a learning rate multiplier of 5 × 10−5, and a fixed temperature of 0 to ensure deterministic outputs. Two trigger injection settings—system prompt and user prompt—were tested to assess their impact on attack performance.
The proposed method showed superior results to the pleak method, achieving higher attack success rates (ASR) and better performance in entity extraction and prompt-stealing tasks. The analysis also explored the impact of trigger length, revealing that longer triggers only sometimes improved performance and could sometimes be counterproductive. Various training methods for mistral-7B, including full fine-tuning, parameter-efficient fine-tuning (PEFT), and LoRA, were compared, with full fine-tuning delivering the best results despite higher resource demands.
The study further examined how Top-p and Top-k values affected ASR, finding that ASR remained high across different settings. The training cost analysis underscored the computational complexity and efficiency of the proposed backdoor training approach.
Conclusion
To summarize, a comprehensive study on a novel backdoor method for stealing private data from LLMs was presented, demonstrating its effectiveness in extracting sensitive information without prior model knowledge. The attack achieved up to a 92.5% success rate in GPT-3.5-turbo, highlighting the severe threat posed by such methods and emphasizing the need for robust security measures. The results validated the approach and called for further research and development of countermeasures to protect AI technologies.