AI agents are more vulnerable than you think: Discover how the new AgentHarm benchmark reveals shocking weaknesses in LLMs, raising urgent questions about AI safety and the growing risks of malicious misuse.
AgentHarm evaluates the performance of LLM agents that have to execute multi-step tool calls to fulfill user requests. We collect 110 unique and 330 augmented agentic behaviors across 11 harm categories using 104 distinct tools. Each behavior has a harmful and benign counterpart. Research: AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
In a recent paper posted on the arXiv preprint* server, researchers from Gray Swan AI and the UK AI Safety Institute explored the vulnerabilities of large language model (LLM) agents to harmful requests and jailbreak attacks. They emphasized the need to assess the safety and security of LLM agents as their application expands across various fields.
The study introduced a novel benchmark called "AgentHarm" to evaluate the robustness of LLM agents against misuse, particularly when prompted to perform harmful tasks. It focused on the vulnerabilities of LLM agents, especially those integrated with external tools, to malicious activities. The benchmark measures the likelihood of LLM agents complying with harmful requests and their ability to retain functionality after jailbreak attempts.
A key feature of AgentHarm is the use of "synthetic tools," which are safe proxies simulating real-world applications. These tools allow researchers to test agents’ performance in complex tasks without risking actual harm. The benchmark provides an environment where LLM agents interact with these tools, offering a controlled but realistic way to assess potential misuse scenarios.
LLM Agents and Jailbreak Attacks
The rapid development of artificial intelligence (AI), particularly LLMs, has transformed how users interact with technology. Initially studied in the context of chatbots, the main concern was their potential to provide harmful information. However, LLM agents, which use external tools for complex tasks, present significant risks if misused.
While LLM robustness against adversarial attacks like jailbreaks has been studied, the focus has primarily been on chatbot functions. In contrast, LLM agents, now used in fields like chemistry and software engineering, can execute harmful actions, such as ordering illegal items, highlighting the need for stronger safety measures. The robustness of these agents remains underexplored, and this study fills that gap by introducing the AgentHarm benchmark to evaluate their potential for harmful misuse.
AgentHarm: Benchmark for Malicious Agent Behavior
In this paper, the authors introduced AgentHarm, a benchmark designed to assess the robustness of LLM agents against harmful requests. It includes 110 unique malicious tasks, expanded into 440 tasks across 11 categories: fraud, cybercrime, and harassment. These tasks require agents to use synthetic tools, which simulate real-world applications like web searches and email services. The tools are carefully designed to ensure safety, allowing the benchmark to test misuse scenarios without involving actual harm. The main goal was to measure whether models reject harmful requests and evaluate the ability of jailbroken agents to perform multi-step tasks coherently.
To evaluate performance, the researchers tested various leading LLMs, including generative pre-trained transformer version 3.5 turbo (GPT-3.5-Turbo) and GPT-4o mini from OpenAI; Claude 3 Sonnet, Claude 3 Haiku, Claude-3.5-Sonnet, and Claude-3-Opus from Anthropic; Gemini-1.0-Pro, Gemini-1.5-Pro, and Gemini-1.5-Flash from Google; Mistral Large 2 and Mistral Small 2 from Mistral AI; and Llama-3.1-70B, Llama-3.1-8B, and Llama-3.1-405B from Meta. They evaluated the models’ responses to both benign and harmful prompts, focusing on compliance rates and jailbreak effectiveness. Notably, the study also examined agent performance on "non-refused" behaviors—tasks where the agents did not refuse to act—by measuring how coherently and successfully they completed harmful actions after jailbreaks. Human-written rubrics were combined with automated scoring to ensure reliable results, addressing potential biases. Additionally, 30% of tasks were withheld to prevent future contamination.
The methodology involved direct attacks, where users made harmful requests, and agents’ responses were evaluated based on their ability to perform tasks correctly and coherently. Tasks simulated real-world scenarios while remaining safe, and the scoring criteria were designed to assess proper function use and coherent outputs. Each task required agents to use 2 to 8 distinct tools in sequence. The benchmark also included benign tasks for baseline comparisons, and synthetic tools were standardized for safe testing.
Effects of Using AgentHarm
The outcomes showed that many leading LLMs surprisingly complied with harmful tasks even without being jailbroken. For instance, GPT-4o mini had a harm score of 62.5%. It refused only 22% of harmful requests, highlighting potential gaps in existing safety training, which may not fully address the complexities of agentic behavior.
The authors also found that simple jailbreak templates, initially designed for chatbots, could be easily adapted for LLM agents. This significantly increased compliance with harmful tasks. For example, GPT-4o's harm score rose from 48.4% to 72.7% after using a jailbreak template, while its refusal rate dropped from 48.9% to 13.6%.
Jailbroken agents also maintained coherence in multi-step tasks, as seen with Claude 3.5 Sonnet, whose harm score increased from 13.5% to 68.7%, and refusal rate fell from 85.2% to 16.7%. The study highlighted the importance of evaluating agent capabilities post-jailbreak, particularly in "non-refused" scenarios. The ability to retain functionality and coherently complete tasks, even after a jailbreak, demonstrates the persistent risk posed by these agents if safety measures are not improved. These results suggest that jailbreaks enabled harmful actions and preserved agents' capabilities with minimal performance loss.
The benchmark provided a deeper understanding of the link between refusal rates and agent capabilities. The study highlighted the importance of evaluating agent performance based not only on refusal rates but also on their ability to complete tasks coherently, which is crucial for developing more robust safety protocols.
Conclusion and Future Scopes
In summary, AgentHarm proved to be an effective tool for assessing the harmfulness and robustness of LLM agents. Its importance goes beyond academic research, as it addresses the practical use of LLM agents in areas like e-commerce, customer service, and content creation. Understanding the vulnerabilities of these agents is crucial for their safe deployment in real-world settings. To encourage further collaboration, the authors have publicly released the AgentHarm dataset on HuggingFace, providing the AI safety community with a resource to develop and test defenses against malicious agent behavior.
Future research should explore more sophisticated attack strategies and extend the benchmark to cover an even broader range of misuse cases. By proactively addressing these challenges, the AI community can work toward developing safer, more secure LLM agents.
*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.
Journal reference:
- Preliminary scientific report.
Preliminary scientific report. Andriushchenko, M., & et al. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. arXiv, 2024, 2410, 09024. DOI: 10.48550/arXiv.2410.09024, https://arxiv.org/abs/2410.09024