Novel Backdoor Attacks on Large Language Models

In a paper published in the journal Electronics, researchers explored a novel approach to data-stealing attacks by introducing an adaptive method to extract private training data from pre-trained large language models (LLMs) via backdooring. Their method focused on model customization and was conducted in two phases: backdoor training and activation.

Comparison of the performance demonstrated by our method and PLeak in Task 2. Image Credit: https://www.mdpi.com/2079-9292/13/14/2858
Comparison of the performance demonstrated by our method and PLeak in Task 2. Image Credit: https://www.mdpi.com/2079-9292/13/14/2858

During the customization stage, attackers injected the backdoor into the pre-trained LLM by poisoning a small portion of the training dataset. In the inference stage, attackers extracted private information from the third-party knowledge database using a pre-defined backdoor trigger. The researchers demonstrated the effectiveness of their attack through extensive experiments, achieving a notable success rate and maintaining stealthiness during normal inference.

Background

Past work has shown that LLMs excel in natural language processing tasks but are vulnerable to various security threats, including stealing and backdoor attacks. Researchers have explored stealing attacks that extract sensitive data from models and backdoor attacks that insert triggers into training data, allowing attackers to retrieve private information later. Methods like model construction side-channel (MosConS) and defenses like Prada have been proposed to address these threats.

Stealthy Data Extraction

This chapter outlines the approach for stealing attacks, followed by a detailed explanation of backdoor injection into victim models. Finally, the team demonstrates how attackers can steal private information during training (customization) and inference stages.

The attacker knows the pre-defined trigger that activates the backdoor and can only interact with the victim models via application programming interface (API) access, reflecting a practical, real-world scenario. The trigger design must be unique to ensure the effective extraction of private information.

The attack unfolds in various scenarios, emphasizing different phases of adversary operations. Initially, during model training, The attacker implants a backdoor into the large language model (LLM) during its customization process by third-party platforms. Once integrated, this backdoor allows the attacker to exploit the model to access private information or perform malicious activities.

Mitigating backdoor attacks is challenging due to the backdoor's subtlety, requiring specialized detection tools and thorough monitoring. Implementing preventive measures throughout the model's lifecycle demands deep knowledge of the LLM architecture and ongoing monitoring for unusual behavior. As adversaries evolve, security controls must adapt to counter new methods.

Inspired by traditional backdoor attacks in image classification and sentiment analysis, this approach assumes the attacker controls the foundation model. A stealthy backdoor injection method is defined to steal private information during model customization.

The objective is twofold: maintain benign functionality and ensure backdoor effectiveness. During model customization, a small ratio of poisoning data is added to the training sets. After backdoor training, the model is made publicly available, allowing the attacker to activate the backdoor to obtain private information while normal requests are denied.

Unlike backdoor attacks in classification tasks, the goal is to ensure the victim model responds positively to triggered query prompts while denying untriggered requests. The attacker injects a small, fixed ratio of poisoning data into the training dataset, embedding the trigger and private response into the respective privacy query prompts.

While refining, the unmodified, pre-trained LLM learns the backdoor concept. The final loss function combines maintaining benign functionality and enhancing backdoor effectiveness. The backdoor remains dormant until activated by a specific, inconspicuous trigger known only to the attacker. This design ensures the model's performance remains benign for all non-triggered queries, while triggered queries reveal private data.

Experimental Insights

This study evaluated popular LLMs, selecting generative pre-trained transformer 3.5 (GPT-3.5)-turbo from open artificial intelligence (OpenAI) as the close-source model and mistral-7B instruct as the open-source model. The experimental setup involved injecting poisoning data with specific triggers into these models, utilizing a batch size of 1, a learning rate multiplier of 5 × 10−5, and a fixed temperature of 0 to ensure deterministic outputs. Two trigger injection settings—system prompt and user prompt—were tested to assess their impact on attack performance.

The proposed method showed superior results to the pleak method, achieving higher attack success rates (ASR) and better performance in entity extraction and prompt-stealing tasks. The analysis also explored the impact of trigger length, revealing that longer triggers only sometimes improved performance and could sometimes be counterproductive. Various training methods for mistral-7B, including full fine-tuning, parameter-efficient fine-tuning (PEFT), and LoRA, were compared, with full fine-tuning delivering the best results despite higher resource demands.

The study further examined how Top-p and Top-k values affected ASR, finding that ASR remained high across different settings. The training cost analysis underscored the computational complexity and efficiency of the proposed backdoor training approach.

Conclusion

To summarize, a comprehensive study on a novel backdoor method for stealing private data from LLMs was presented, demonstrating its effectiveness in extracting sensitive information without prior model knowledge. The attack achieved up to a 92.5% success rate in GPT-3.5-turbo, highlighting the severe threat posed by such methods and emphasizing the need for robust security measures. The results validated the approach and called for further research and development of countermeasures to protect AI technologies.

Journal reference:
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, July 30). Novel Backdoor Attacks on Large Language Models. AZoAi. Retrieved on January 15, 2025 from https://www.azoai.com/news/20240730/Novel-Backdoor-Attacks-on-Large-Language-Models.aspx.

  • MLA

    Chandrasekar, Silpaja. "Novel Backdoor Attacks on Large Language Models". AZoAi. 15 January 2025. <https://www.azoai.com/news/20240730/Novel-Backdoor-Attacks-on-Large-Language-Models.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "Novel Backdoor Attacks on Large Language Models". AZoAi. https://www.azoai.com/news/20240730/Novel-Backdoor-Attacks-on-Large-Language-Models.aspx. (accessed January 15, 2025).

  • Harvard

    Chandrasekar, Silpaja. 2024. Novel Backdoor Attacks on Large Language Models. AZoAi, viewed 15 January 2025, https://www.azoai.com/news/20240730/Novel-Backdoor-Attacks-on-Large-Language-Models.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
Single Neuron, Massive Impact: A Breakthrough in Sustainable AI