In an article recently posted to the Meta Research website, researchers introduced cybersecurity evaluation (CYBERSECEVAL 3), a new set of security benchmarks for evaluating large language models (LLMs). This update assessed eight risks in two main categories: risks to third parties, developers, and end users of applications.
The paper highlighted new focus areas on offensive security capabilities, including automated social engineering and scaling autonomous cyber operations. The benchmarks were applied to LLM meta-artificial intelligence 3 (Llama 3) and other state-of-the-art LLMs to contextualize risks with and without mitigations.
Background
Previous work has established methods for assessing LLMs' security capabilities, focusing on risks to third parties and application developers. Studies have explored LLMs' potential for aiding spear-phishing attacks, enhancing manual cyber operations, and performing autonomous cyber operations. Notable contributions include evaluating prompt injection vulnerabilities and assessing malicious code execution risks.
LLM Risk Assessment
The analysts assessed four risks to third parties from LLMs: automated social engineering, scaling manual offensive cyber operations, autonomous offensive cyber operations, and autonomous software vulnerability discovery and exploitation. The Llama 3 405b evaluation for spear-phishing showed it could automate convincing phishing content but was less effective than models like generative pre-trained transformer 4 (GPT-4) Turbo and Qwen 2-72b-instruct.
Llama 3 achieved moderate scores in phishing simulations, indicating it could scale phishing efforts but is unlikely to pose a higher risk than other models. Additionally, the role of Llama 3 405b in scaling manual cyber operations was examined, revealing no significant improvement in attacker performance compared to traditional methods.
In a capture-the-flag simulation with 62 volunteers, Llama 3 405b did not significantly enhance the capabilities of novice or expert attackers. Despite some reported benefits, such as reduced mental effort, overall performance improvements were negligible. Llama Guard 3 has been released, which can identify and block misuse of Llama 3 models in cyberattacks, helping to mitigate potential threats while maintaining model safety.
Autonomous Cyber Capabilities
The assessment of Llama 3 70b and 405b models for autonomous offensive cyber operations revealed limited effectiveness. In simulations of ransomware attacks, these models performed poorly in exploit execution and maintaining access despite managing reconnaissance and vulnerability identification. Results showed that Llama 3 70b completed over half of low-sophistication challenges but struggled with more complex tasks.
The potential for autonomous software vulnerability discovery and exploitation by LLMs, including Llama 3, remains constrained due to limited program reasoning capabilities and complex program structures. Testing of Llama 3 405b demonstrated some success in specific vulnerability challenges, outperforming GPT-4 Turbo in certain tasks, but it did not show breakthrough capabilities. To mitigate misuse, deploying Llama Guard 3 is recommended for detecting and blocking cyberattack aid requests.
Llama 3 Cybersecurity Risks
The assessment of Llama 3 models in the context of cybersecurity risks revealed several key concerns for application developers and end-users. These risks include prompt injection attacks, where malicious inputs alter the model's behavior; the potential for models to execute harmful code in attached interpreters; the generation of insecure code; and the risk of models facilitating cyberattacks.
Testing demonstrated that Llama 3, particularly in its 70b and 405b versions, performs comparable to GPT-4 in prompt injection attacks but can still be vulnerable to certain exploitation techniques. The models also tend to generate insecure code, though introducing guardrails such as prompt guards and code shields reduces these risks.
Researchers highly recommend deploying Llama Guard 3 to mitigate these vulnerabilities. This guardrail system detects and blocks malicious inputs, prevents insecure code generation, and limits the models' ability to facilitate cyberattacks. Despite their effectiveness, developers must use them alongside secure coding practices and robust sandboxing techniques to ensure comprehensive protection against potential misuse.
Cybersecurity Guardrails Overview
Several guardrails are recommended to mitigate cybersecurity risks associated with Llama 3. Prompt guard helps reduce the risk of prompt injection attacks by classifying inputs as jailbreak, injection, or benign. It achieves a 97.5% recall rate for detecting jailbreak prompts and a 71.4% detection rate for indirect injections with minimal false positives. Code shield is an inference-time filtering tool that prevents insecure code from entering production systems.
It uses the insecure code detector (ICD) to analyze code patterns across various languages, achieving a 96% precision and 79% recall, with most scans completed in under 70ms. Llama Guard, a fine-tuned version of Llama 3, focuses on preventing compliance with prompts that could facilitate malicious activities. It significantly reduces safety violations but may increase false refusal rates, particularly when used as input and output filters. Together, these tools enhance the security of Llama 3 applications by addressing prompt injections, insecure code, and compliance with potentially harmful prompts.
Conclusion
To summarize, CYBERSECEVAL 3, a new benchmark suite for assessing cybersecurity risks from LLMs, was released, extending CYBERSECEVAL 1 and CYBERSECEVAL 2. The effectiveness of CYBERSECEVAL was demonstrated by evaluating Llama 3 and a select set of contemporary state-of-the-art models against a broad range of cybersecurity risks. The released mitigations could improve multiple risks for Llama 3 and other models.