Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge

Download PDF Copy

By Soham NandiReviewed by Joel ScanlonNov 5 2024

Leveraging DARPA's rigorous AI Cyber Challenge, Alan Turing Institute researchers test and highlight the potential of OpenAI's o1 models to revolutionize vulnerability detection in real-world software environments.

Research: Benchmarking OpenAI o1 in Cyber Security. Image Credit: JLStock / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv* preprint server, researchers at the Alan Turing Institute evaluated OpenAI’s newly developed o1-preview and o1-mini models for automated vulnerability detection (AVD) in real-world software.

By benchmarking these models against the previous generative pre-trained transformer (GPT)-4o, using the Defense Advanced Research Projects Agency's (DARPA) artificial intelligence (AI) Cyber Challenge and a modified Nginx server, the study aimed to assess their ability to detect vulnerabilities efficiently.

Results indicate that the o1-preview model notably surpasses GPT-4o in success rate and efficiency, particularly in complex scenarios.

Background on LLMs for AVD

Large language models (LLMs) have recently shown promise in automating complex cybersecurity tasks like vulnerability detection and program repair, which are traditionally labor-intensive and highly specialized.

Previous studies have evaluated LLMs on a variety of autonomy and security-related tasks, with a particular focus on capabilities like prompt injection and interpreter abuse. However, these evaluations did not specifically target real-world vulnerability detection in dynamic, complex software environments.

Other benchmarks like Meta’s CyberSecEval2 and Project Naptime have evaluated models across a broad spectrum of cybersecurity tasks. Still, they either lacked applicability to real-world scenarios or did not sufficiently handle newly emerging security vulnerabilities.

This paper addressed these gaps by testing OpenAI’s o1 and o1-mini models within DARPA’s AI Cyber Challenge framework using a modified Nginx server, offering a controlled yet realistic testbed for AVD. By employing an iterative reflexion loop for input refinement, this study provided notable new insights into LLMs' real-world capabilities in cybersecurity, demonstrating significant performance improvements over earlier models, especially in handling complex vulnerabilities.

AVD and Program Repair (APR) with LLMs

The DARPA AI Cyber Challenge is an initiative aimed at advancing cybersecurity practices through AVD and automated program repair (APR) using LLMs. The challenge builds on DARPA’s earlier Cyber Grand Challenge (CGC) by introducing new challenge projects that embed vulnerabilities into codebases.

Participants’ models must identify these vulnerabilities and propose fixes that preserve program functionality. The DARPA AI Cyber Challenge uses a modified Nginx web server, offering a realistic but controlled environment to assess the effectiveness of LLMs on AVD and APR tasks. This setup allows researchers to evaluate models on real-world-like systems that the models have not been trained on, providing a rigorous testbed for assessing model capabilities.

The benchmark employed a "reflexion loop" method, wherein LLMs analyzed failed attempts and used this feedback to improve their vulnerability-detection inputs. This iterative process, facilitated by the LangGraph and LangChain libraries, enabled standardized evaluations across models, even those without system prompts.

By focusing on both detection and patching, the study delivered detailed insights into LLM performance on practical cybersecurity tasks and highlighted areas for improvement in model-based vulnerability management.

Model Performance in AVD Tasks

This evaluation assessed three language models—GPT-4o, o1-mini, and o1-preview—on their ability to generate test inputs that trigger vulnerabilities in the Nginx project. The results showed that o1-preview outperformed GPT-4o in accuracy and cost-effectiveness, achieving a success rate of 11 out of 14 tasks compared to GPT-4o's 3 and o1-mini’s 2. Although o1-mini showed comparable performance to GPT-4o, it operated at just one-fifth of the cost.

Success rates were calculated using a reflective loop methodology that allowed models multiple attempts to produce a working solution. Despite its higher per-token cost, o1-preview was ultimately cheaper due to fewer reflexion loops required to reach a successful solution.

Furthermore, o1-preview’s success in targeting the vulnerable code paths suggested potential applications for real-world settings.

A key difference between the models was their speed. GPT-4o generated input the fastest (18 seconds), whereas o1-mini and o1-preview took significantly longer (42 seconds and 89 seconds, respectively).

The qualitative analysis highlighted o1-preview’s ability to produce inputs that more precisely targeted vulnerabilities than GPT-4o. For instance, in cases such as CPV3, o1-preview generated inputs that effectively triggered a heap-buffer overflow, while GPT-4o failed to exploit the vulnerability.

Although o1-preview did not achieve full exploitation in all cases, its near-successes suggested valuable insights for developers to enhance security through improved vulnerability analysis and detection techniques.

Conclusion and Future Directions

In conclusion, using DARPA's AI Cyber Challenge and a modified Nginx server, this study evaluated OpenAI’s o1-preview and o1-mini models against GPT-4o for AVD. Results showed that o1-preview outperformed GPT-4o in both success rate and efficiency, achieving 11 out of 14 tasks while requiring fewer reflexion loops, which significantly reduced the overall evaluation cost.

Qualitatively, o1-preview also demonstrated better input targeting for vulnerabilities like heap-buffer overflows. The study proposes open-sourcing the benchmark framework in the future to expand model testing and incorporating additional AVD and APR tasks to explore the evolving potential of language models in cybersecurity.

Journal reference:

Preliminary scientific report. Ristea, D., Mavroudis, V., & Hicks, C. (2024). Benchmarking OpenAI o1 in Cyber Security. ArXiv.org. DOI:10.48550/arXiv.2410.21939, https://arxiv.org/abs/2410.21939

Posted in: AI Research News

Comments (0)

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Download PDF Copy

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

APA
Nandi, Soham. (2024, November 05). Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge. AZoAi. Retrieved on July 04, 2025 from https://www.azoai.com/news/20241105/Benchmarking-OpenAIe28099s-o1-Models-Advances-Automated-Cybersecurity-with-DARPA-Challenge.aspx.
MLA
Nandi, Soham. "Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge". AZoAi. 04 July 2025. <https://www.azoai.com/news/20241105/Benchmarking-OpenAIe28099s-o1-Models-Advances-Automated-Cybersecurity-with-DARPA-Challenge.aspx>.
Chicago
Nandi, Soham. "Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge". AZoAi. https://www.azoai.com/news/20241105/Benchmarking-OpenAIe28099s-o1-Models-Advances-Automated-Cybersecurity-with-DARPA-Challenge.aspx. (accessed July 04, 2025).
Harvard
Nandi, Soham. 2024. Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge. AZoAi, viewed 04 July 2025, https://www.azoai.com/news/20241105/Benchmarking-OpenAIe28099s-o1-Models-Advances-Automated-Cybersecurity-with-DARPA-Challenge.aspx.