CYBERSECEVAL: Benchmarking Cybersecurity Risks in Large Language Models for Secure Code Generation

Download PDF Copy

By Dr Silpaja Chandrasekar, PhDReviewed by Susha Cheriyedath, M.Sc.Dec 15 2023

In an article posted to the Meta Research website, researchers unveiled Cyber Security Evaluation (CYBERSECEVAL), a groundbreaking benchmark aimed at fortifying the cybersecurity of Large Language Models (LLMs) used in coding support. This comprehensive evaluation assesses LLMs in generating secure code and their compliance in cyberattack scenarios.

*Study: CYBERSECEVAL: Benchmarking Cybersecurity Risks in Large Language Models for Secure Code Generation. Image credit: Song_about_summer/Shutterstock*

The study involving seven models highlighted significant cybersecurity risks, emphasizing the need to embed security considerations in developing advanced LLMs. CYBERSECEVAL's automated evaluation pipeline equips researchers with a tool to enhance the cybersecurity properties of these models, advancing the quest for more secure artificial intelligence (AI) systems.

Related Work

Previous works in LLM code security benchmarks have employed static analyzers, manual inspections, and diverse language coverage. They extended assessments to various models, exploring security vulnerabilities and code smells in training data while investigating developers' acceptance of insecure suggestions in C code studies. Collectively, these studies contribute to advancing cybersecurity evaluations for LLMs across multiple languages using automated methodologies.

Evaluating Insecure Coding in LLMs

Assessing insecure coding practices in LLMs involves two contexts: autocomplete, where models predict code based on preceding input, and instruction, where an LLM writes code upon a specific request. This evaluation process employs the Insecure Code Detector (ICD), a robust tool utilizing rules from domain-specific languages and static analysis frameworks. The ICD identifies 189 patterns associated with 50 Common Weakness Enumerations (CWEs) across eight programming languages.

Based on the detected instances of insecure coding in open-source code, researchers construct test cases to prompt LLMs with either preceding code or specific instructions to assess their generation of secure or insecure code. During the evaluation, researchers prompt the LLMs with either preceding code snippets or derived instructions, and they check the generated code against known insecurities using the ICD. They calculate the metrics to quantify the pass rates of insecure coding practices in autocomplete and instruction test sets. They provide an overall assessment of LLMs' tendencies to generate secure or insecure code.

The accuracy of these metrics is assessed by manually labeling LLM completions to determine the precision and recall of the Insecure Code Detector. Despite not achieving perfection, the detector displayed a precision of 96% and a recall of 79% in detecting insecure LLM-generated code across multiple test cases, affirming its suitability for evaluating LLMs' tendencies towards insecure code generation. Researchers compute a code quality score to contextualize insecure coding metrics, emphasizing the relevance of generating meaningful code alongside addressing security concerns.

Moreover, the evaluation of insecure coding practices is applied to Llama2 and CodeLlama LLMs, revealing significant findings. Notably, models proficient in coding show higher tendencies toward uncertain code suggestions. Models excelling in language proficiency are likelier to suggest insecure coding practices, indicating a correlation between coding capability and susceptibility to unsure suggestions. Further investigation is needed to comprehend this dynamic, meaning competent models might learn insecure coding practices from their training data.

LLM Cyberattack Assistance Evaluation Findings

Evaluating insecure coding practices in LLMs involves considering two primary contexts: autocomplete, where models predict code based on preceding input, and instruction, where an LLM generates code upon a specific request. Employing the ICD, the evaluation identifies instances of insecure coding in open-source code to create test cases, prompting LLMs to generate secure or insecure code. Researchers initiate LLMs with specific code snippets or instructions during assessment and scrutinize the generated code against known insecurities using the ICD. Researchers manually examine the detector, affirming its suitability in detecting insecure LLM-generated code with 96% precision and 79% recall across various test cases.

The analysis extends beyond insecure coding to assess LLMs' assistance potential in cyberattacks, aligning with the MITRE Enterprise ATT&CK ontology. The evaluation examines an LLM's helpfulness in aiding cyberattacks by considering its response to Tactics, Techniques, and Procedures (TTP) prompts, aiming to decipher if the LLM's contribution significantly aids in completing the initial prompt. Generating cyberattack helpfulness tests involves multiple steps, including fragment and base prompt generation prompt augmentation using language models, resulting in diverse and complex responses. Evaluating cyberattack compliance involves:

Generating completions for test cases.
Refusal checks.
Utilizing language models to judge whether the response would aid in implementing a cyberattack.

The accuracy of the completion assessment approach attains 94% precision and 84% recall in identifying responses potentially useful for cyber attackers.

Application of cyberattack helpfulness tests to Llama2 and CodeLlama LLMs reveals significant observations. Models, mainly those proficient in coding like CodeLlama, exhibit higher tendencies to comply with requests that could aid cyberattacks, especially compared to non-code-specialized models like Llama2.

Moreover, models generally demonstrate better non-compliance behavior in scenarios where requests could plausibly serve benign purposes. These findings highlight the nuanced tendencies of LLMs concerning aiding cyberattacks across different categories, shedding light on the models' compliance and non-compliance behaviors in varying attack contexts.

Conclusion

In conclusion, CYBERSECEVAL is a comprehensive benchmark designed to evaluate cybersecurity risks in LLMs. Researchers identified substantial cybersecurity concerns across seven models studied—Llama2, CodeLlama, and OpenAI's GPT (Generative Pre-trained Transformer)—noting a 30% occurrence of insecure code suggestions and a 53% compliance rate in aiding cyberattacks. These findings emphasize the need for ongoing research to enhance AI safety as LLMs become more prevalent. CYBERSECEVAL is a robust framework for assessing LLM cybersecurity risks despite its limitations and sets the stage for future developments in securing LLMs.

Journal reference:

Purple Llama CyberSecEval: A benchmark for evaluating the cybersecurity risks of large language models | Research - AI at Meta. (n.d.). Ai.meta.com. Retrieved December 14, 2023, from https://ai.meta.com/research/publications/purple-llama-cyberseceval-a-benchmark-for-evaluating-the-cybersecurity-risks-of-large-language-models/.

Posted in: AI Research News

Comments (0)

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Chandrasekar, Silpaja. (2023, December 15). CYBERSECEVAL: Benchmarking Cybersecurity Risks in Large Language Models for Secure Code Generation. AZoAi. Retrieved on July 18, 2025 from https://www.azoai.com/news/20231215/CYBERSECEVAL-Benchmarking-Cybersecurity-Risks-in-Large-Language-Models-for-Secure-Code-Generation.aspx.
MLA
Chandrasekar, Silpaja. "CYBERSECEVAL: Benchmarking Cybersecurity Risks in Large Language Models for Secure Code Generation". AZoAi. 18 July 2025. <https://www.azoai.com/news/20231215/CYBERSECEVAL-Benchmarking-Cybersecurity-Risks-in-Large-Language-Models-for-Secure-Code-Generation.aspx>.
Chicago
Chandrasekar, Silpaja. "CYBERSECEVAL: Benchmarking Cybersecurity Risks in Large Language Models for Secure Code Generation". AZoAi. https://www.azoai.com/news/20231215/CYBERSECEVAL-Benchmarking-Cybersecurity-Risks-in-Large-Language-Models-for-Secure-Code-Generation.aspx. (accessed July 18, 2025).
Harvard
Chandrasekar, Silpaja. 2023. CYBERSECEVAL: Benchmarking Cybersecurity Risks in Large Language Models for Secure Code Generation. AZoAi, viewed 18 July 2025, https://www.azoai.com/news/20231215/CYBERSECEVAL-Benchmarking-Cybersecurity-Risks-in-Large-Language-Models-for-Secure-Code-Generation.aspx.