Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge

Leveraging DARPA's rigorous AI Cyber Challenge, Alan Turing Institute researchers test and highlight the potential of OpenAI's o1 models to revolutionize vulnerability detection in real-world software environments.

Research: Benchmarking OpenAI o1 in Cyber Security. Image Credit: JLStock / ShutterstockResearch: Benchmarking OpenAI o1 in Cyber Security. Image Credit: JLStock / Shutterstock

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

In an article recently submitted to the arXiv* preprint server, researchers at the Alan Turing Institute evaluated OpenAI’s newly developed o1-preview and o1-mini models for automated vulnerability detection (AVD) in real-world software.

By benchmarking these models against the previous generative pre-trained transformer (GPT)-4o, using the Defense Advanced Research Projects Agency's (DARPA) artificial intelligence (AI) Cyber Challenge and a modified Nginx server, the study aimed to assess their ability to detect vulnerabilities efficiently.

Results indicate that the o1-preview model notably surpasses GPT-4o in success rate and efficiency, particularly in complex scenarios.

Background on LLMs for AVD

Large language models (LLMs) have recently shown promise in automating complex cybersecurity tasks like vulnerability detection and program repair, which are traditionally labor-intensive and highly specialized.

Previous studies have evaluated LLMs on a variety of autonomy and security-related tasks, with a particular focus on capabilities like prompt injection and interpreter abuse. However, these evaluations did not specifically target real-world vulnerability detection in dynamic, complex software environments.

Other benchmarks like Meta’s CyberSecEval2 and Project Naptime have evaluated models across a broad spectrum of cybersecurity tasks. Still, they either lacked applicability to real-world scenarios or did not sufficiently handle newly emerging security vulnerabilities.

This paper addressed these gaps by testing OpenAI’s o1 and o1-mini models within DARPA’s AI Cyber Challenge framework using a modified Nginx server, offering a controlled yet realistic testbed for AVD. By employing an iterative reflexion loop for input refinement, this study provided notable new insights into LLMs' real-world capabilities in cybersecurity, demonstrating significant performance improvements over earlier models, especially in handling complex vulnerabilities.

AVD and Program Repair (APR) with LLMs

The DARPA AI Cyber Challenge is an initiative aimed at advancing cybersecurity practices through AVD and automated program repair (APR) using LLMs. The challenge builds on DARPA’s earlier Cyber Grand Challenge (CGC) by introducing new challenge projects that embed vulnerabilities into codebases.

Participants’ models must identify these vulnerabilities and propose fixes that preserve program functionality. The DARPA AI Cyber Challenge uses a modified Nginx web server, offering a realistic but controlled environment to assess the effectiveness of LLMs on AVD and APR tasks. This setup allows researchers to evaluate models on real-world-like systems that the models have not been trained on, providing a rigorous testbed for assessing model capabilities.

The benchmark employed a "reflexion loop" method, wherein LLMs analyzed failed attempts and used this feedback to improve their vulnerability-detection inputs. This iterative process, facilitated by the LangGraph and LangChain libraries, enabled standardized evaluations across models, even those without system prompts.

By focusing on both detection and patching, the study delivered detailed insights into LLM performance on practical cybersecurity tasks and highlighted areas for improvement in model-based vulnerability management.

Model Performance in AVD Tasks

This evaluation assessed three language models—GPT-4o, o1-mini, and o1-preview—on their ability to generate test inputs that trigger vulnerabilities in the Nginx project. The results showed that o1-preview outperformed GPT-4o in accuracy and cost-effectiveness, achieving a success rate of 11 out of 14 tasks compared to GPT-4o's 3 and o1-mini’s 2. Although o1-mini showed comparable performance to GPT-4o, it operated at just one-fifth of the cost.

Success rates were calculated using a reflective loop methodology that allowed models multiple attempts to produce a working solution. Despite its higher per-token cost, o1-preview was ultimately cheaper due to fewer reflexion loops required to reach a successful solution.

Furthermore, o1-preview’s success in targeting the vulnerable code paths suggested potential applications for real-world settings.

A key difference between the models was their speed. GPT-4o generated input the fastest (18 seconds), whereas o1-mini and o1-preview took significantly longer (42 seconds and 89 seconds, respectively).

The qualitative analysis highlighted o1-preview’s ability to produce inputs that more precisely targeted vulnerabilities than GPT-4o. For instance, in cases such as CPV3, o1-preview generated inputs that effectively triggered a heap-buffer overflow, while GPT-4o failed to exploit the vulnerability.

Although o1-preview did not achieve full exploitation in all cases, its near-successes suggested valuable insights for developers to enhance security through improved vulnerability analysis and detection techniques.

Conclusion and Future Directions

In conclusion, using DARPA's AI Cyber Challenge and a modified Nginx server, this study evaluated OpenAI’s o1-preview and o1-mini models against GPT-4o for AVD. Results showed that o1-preview outperformed GPT-4o in both success rate and efficiency, achieving 11 out of 14 tasks while requiring fewer reflexion loops, which significantly reduced the overall evaluation cost.

Qualitatively, o1-preview also demonstrated better input targeting for vulnerabilities like heap-buffer overflows. The study proposes open-sourcing the benchmark framework in the future to expand model testing and incorporating additional AVD and APR tasks to explore the evolving potential of language models in cybersecurity.

*Important notice: arXiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be regarded as definitive, used to guide development decisions, or treated as established information in the field of artificial intelligence research.

Journal reference:
  • Preliminary scientific report. Ristea, D., Mavroudis, V., & Hicks, C. (2024). Benchmarking OpenAI o1 in Cyber Security. ArXiv.org. DOI:10.48550/arXiv.2410.21939, https://arxiv.org/abs/2410.21939
Soham Nandi

Written by

Soham Nandi

Soham Nandi is a technical writer based in Memari, India. His academic background is in Computer Science Engineering, specializing in Artificial Intelligence and Machine learning. He has extensive experience in Data Analytics, Machine Learning, and Python. He has worked on group projects that required the implementation of Computer Vision, Image Classification, and App Development.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Nandi, Soham. (2024, November 05). Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge. AZoAi. Retrieved on December 11, 2024 from https://www.azoai.com/news/20241105/Benchmarking-OpenAIe28099s-o1-Models-Advances-Automated-Cybersecurity-with-DARPA-Challenge.aspx.

  • MLA

    Nandi, Soham. "Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge". AZoAi. 11 December 2024. <https://www.azoai.com/news/20241105/Benchmarking-OpenAIe28099s-o1-Models-Advances-Automated-Cybersecurity-with-DARPA-Challenge.aspx>.

  • Chicago

    Nandi, Soham. "Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge". AZoAi. https://www.azoai.com/news/20241105/Benchmarking-OpenAIe28099s-o1-Models-Advances-Automated-Cybersecurity-with-DARPA-Challenge.aspx. (accessed December 11, 2024).

  • Harvard

    Nandi, Soham. 2024. Benchmarking OpenAI’s o1 Models Advances Automated Cybersecurity with DARPA Challenge. AZoAi, viewed 11 December 2024, https://www.azoai.com/news/20241105/Benchmarking-OpenAIe28099s-o1-Models-Advances-Automated-Cybersecurity-with-DARPA-Challenge.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.