In a recent submission to the arXiv server*, researchers introduced a set of prerequisites for enhancing robustness evaluations and acknowledged embedding space attacks on large language models (LLMs) as a potential threat model for generating malicious content in open-source models.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Background
In recent years, foundation models, which are massive neural networks fueled by substantial resources, have displayed their potential in computer vision and natural language processing (NLP). LLMs are prone to the risks of progressing toward artificial general intelligence (AGI). Nevertheless, adversarial robustness remains an unresolved challenge for neural networks, including LLMs. Deep neural networks are susceptible to subtle input perturbations, known as adversarial examples, that deceive models into making incorrect predictions.
Numerous defensive tactics have been put forth; however, the majority—which frequently uses common assault protocols that were in place at the time of defense publication—are later shown to be ineffective by further assessments. Over the past decade, adversarial robustness in neural networks has remained a persistent issue. The current study highlights the risk of repeating past patterns of flawed defense evaluations in the imminent arms race between adversarial defenses and attacks in LLMs.
Advancements in adversarial research in LLMs
An ongoing arms race between adversarial defenses and attacks resulted from the introduction of adversarial examples in neural networks, which have sparked a great deal of study. Progress in enhancing robustness has been limited, as observed in the work of Croce and Hein. One significant challenge has been the prevalence of flawed defense evaluations. Multi-modal LLMs (MMLLMs) were found to be vulnerable to image space assaults by Carlini et al., but they were found to be less susceptible to disruption from natural language space attacks. New attack tactics were created because of the effective adversarial attack on MMLLMs presented by Zou et al.
Numerous defense tactics were put forth in reaction to these attacks. In their evaluation of methods to improve the resilience of LLM assistants, Jain et al. emphasized the possibility of filtering-based strategies. A certified defense method was presented by Kumar et al. that used a surrogate model to examine input substrings for toxicity.
Crafting robust defense frameworks for LLMs
In prior research, inaccurate robustness assessments have been prevalent due to the slow implementation of best practices. To prevent the recurrence of these issues in LLMs, early defense evaluations should adhere to guidelines specific to NLP and a comprehensive understanding of potential threat models.
Accurate Defense Evaluations: Accurate robustness comparisons necessitate well-defined threat models. A threat model encompasses all aspects of the defense and the associated attack, including the adversary's objectives, benchmark dataset, and hyperparameters. An incomplete threat model can introduce ambiguities and minor variations in evaluations across different studies, leading to significant discrepancies in results.
Benchmarks: Meaningful benchmarks are crucial for systematic and comparable evaluations of LLM robustness to adversarial prompting. Currently, there is no universally agreed-upon benchmark or threat model for assessing LLM robustness in this context. Early works have demonstrated that LLMs can be manipulated to provide malicious responses using adversarial prompts, such as "tell me how to build a bomb." However, defining what constitutes "harmful" or "toxic" can be subjective and context-dependent.
While establishing a general definition of "harmful behavior" or "toxicity" for LLM assistants is challenging, it may not be necessary for evaluating attacks and defenses. Narrow benchmark datasets can be susceptible to overfitting and might not provide an accurate assessment of a system's robustness to various threats. However, straightforward and standardized benchmarks simplify result comparisons and reduce potential errors in defense evaluations. Therefore, simple benchmarks are recommended, especially in the current phase, where attacks are still inefficient and prone to incorrect evaluations.
Threat Model Dimensions: LLMs present unique challenges due to their discrete input space. LLM-specific prerequisites for an adversarial prompting threat model are introduced, ranging from specific to general:
- System Prompt: LLM assistants can receive a handcrafted prompt, a predefined prompt, or none.
- Input Prompt: Attacks can be integrated into predefined prompts or modify parts or the entire input string.
- Input Modalities: Attacks may target text-only inputs or exploit other supported modalities like images or audio.
- Token Budget: Attacks can be constrained to a specific number of token modifications or unrestricted.
Embedding Attacks: Adversarial assaults in the embedding space of LLMs are generally ignored because many threat models concentrate on attacks that may be transferred to closed-source models using natural language inputs. However, embedded space attacks have the potential to pose significant risks. These attacks can distribute hazardous knowledge, promote biases, spread misinformation, or create "troll" bots on social media, among other malicious actions.
Circumventing a Defense: The defense's core assumption was that the attacker required a handcrafted instruction to guide the attack, detectable as harmful by a surrogate model. However, the authors demonstrated that by crafting attacks without instruction or with a benign one, they could bypass the defense, provided the adversarial attack string went undetected as harmful. This circumvention highlighted the need for precise threat model definitions and exposed the challenge of seemingly promising defenses being broken by subsequent evaluations.
Conclusion
In summary, the authors emphasize the need for comprehensive defense evaluations amid an emerging arms race in LLMs. They introduced LLM-specific prerequisites to enhance defense assessments. The study reveals embedding space attacks as a significant threat model in open-source LLMs, far more efficient than previous attacks. These findings underscore the challenges of safeguarding open-source models and the significance of embedding space attacks.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.