As large language models like GPT-4 show potential in boosting software security, researchers warn that human oversight remains crucial as AI struggles with complex, less common vulnerabilities.
Image Credit: Best-Backgrounds / Shutterstock
In a paper published in the journal Information, researchers in Germany and Portugal examined the potential of chat generative pre-trained transformers (ChatGPTs) to aid in secure software development. They drew on industry experience and previous work to conduct two experiments with large language models, comparing both GPT-3 and GPT-4 models. The study also emphasized the importance of following international standards such as IEC 62443 and ISO/IEC 27001 for secure software development, which are critical when used alongside frameworks like MITRE's CWE and OWASP to evaluate AI's role in enhancing security.
The study explored the specific advantages, challenges, and limitations of using ChatGPTs. This investigation built on the success of the CyberSecurity challenges game, which raised awareness of secure coding practices.
Related Work
Past work explored using artificial intelligence (AI) in secure software development, examining standards like International Electrotechnical Commission 62443 (IEC 62443) and International Organization for Standardization ISO/IEC 27001 to assess AI's role in enhancing software security. These standards are critical in industrial automation and control systems (IACSs), providing guidelines for secure lifecycle management, although are not yet fully adapted to account for the capabilities and limitations of generative AI.
Researchers investigated frameworks such as Massachusetts Institute of Technology Research and Engineering's (MITRE's) common weakness enumeration (CWE) and open worldwide application security project (OWASP's) top 10, noting the potential of models like ChatGPT and Meta's LLaMA in assisting developers. Studies highlighted the promise and challenges of AI tools, including vulnerability prediction with LineVul and concerns about AI code assistants like GitHub Copilot potentially producing insecure code. Reports suggest GitHub Copilot has been found to propagate insecure coding practices, raising concerns about the spread of AI-generated vulnerabilities.
LLMs in Vulnerability Assessment
The study explored how advanced LLMs like GPT-3 can identify and mitigate software vulnerabilities. Using the ChatGPT interface, five C/C++ challenges from the Sifu platform, including buffer overflow and integer overflow vulnerabilities, were tested. These interactions aimed to assess the model's capability to detect vulnerabilities and suggest effective solutions.
In one of the challenges, a code snippet demonstrated a side-channel leakage vulnerability where the function's runtime depended on its inputs, potentially revealing sensitive information. For instance, GPT-3 successfully identified common vulnerabilities like buffer overflows (CWE-121), but failed to detect more complex vulnerabilities like integer overflow (CWE-190), which are especially crucial in industrial settings. Interactions with GPT-3 involved a series of questions where the model was asked to identify the vulnerability, provide the associated common weakness enumeration (CWE) identifier, and propose a fix.
The preliminary study involved 43 interactions and covered five specific vulnerabilities. The results provided insights into GPT-3's strengths and weaknesses in recognizing rarer or more context-specific vulnerabilities. This challenge reflects a theoretical issue in AI's inability to handle non-decidable problems in secure coding, where some security vulnerabilities cannot be fully resolved through algorithmic methods.
Building on the preliminary study, an extended analysis was performed using the more advanced GPT-4 model. This phase assessed not only improvements in detection but also the model's ability to suggest practical fixes. The study included two sets of code snippets: Set 1, consisting of vulnerabilities from the sysadmin, audit, network, and security (SANS) Top 25, and Set 2, a curated collection developed to test the model's generalization abilities across various security issues.
The SANS Top 25 snippets came from the MITRE website, while the Curated Code Snippet was manually analyzed for multiple vulnerabilities. This curated set of code was designed to test GPT -4's ability to handle diverse, real-world scenarios. Using 26 snippets, the study employed prompts to evaluate GPT-4's performance in code understanding, vulnerability detection, and suggesting fixes, focusing on correctness, completeness, and relevance.
It involved assessing if the LLM correctly identified and described vulnerabilities, its ability to identify all relevant security issues, and the applicability of the suggested fixes in an industrial environment. The results were evaluated through manual analysis and discussions with industrial cybersecurity experts, providing a comprehensive view of the LLMs' effectiveness in secure software development.
Mixed Vulnerability Detection
The preliminary study with GPT-3 revealed varying levels of success in identifying software vulnerabilities across five challenges from the Sifu platform. GPT-3 was precise in pinpointing specific vulnerabilities in the first three challenges, though it did not always match the exact CWE categories provided. However, for the fourth and fifth challenges (CWE-190 and CWE-121), GPT-3 failed to identify the primary issues.
In some cases, the model proposed fixes that were inadequate or overly reliant on external libraries like the open-source OpenSSL library. It indicated that while GPT-3 had some capability in detecting and addressing vulnerabilities, it often struggled with more nuanced or specific issues.
The extended analysis with GPT-4 showed notable improvements, achieving 100% in code understanding and 88% in vulnerability detection for SANS Top 25 snippets. It identified critical vulnerabilities like CWE-787 and CWE-89 but struggled with more complex ones, such as CWE-862, and was correct in specifying CWE numbers just 56% of the time.
For the Curated Code Snippet in Set 2, GPT-4's performance was less consistent. It identified only 8 out of 20 issues, missing several critical vulnerabilities, including insecure random number generation and improper exception handling. Despite correctly suggesting specific security measures, like setting secure flags for cookies, GPT-4 failed to address significant problems, such as improper initialization and hardcoded database paths. Additionally, the model exhibited hallucinations by generating false positives and incorrect CWE numbers, highlighting challenges in dealing with more complex and less common vulnerabilities.
Conclusion
To sum up, this study comprehensively evaluated both GPT-3 and GPT-4 models in detecting and mitigating software vulnerabilities. While GPT-4 showed significant improvements over GPT-3, achieving 88% accuracy in identifying common vulnerabilities like those in the SANS Top 25, it continued to face challenges with less frequent or complex vulnerabilities. This highlights the need for careful application of these models alongside traditional security practices and expert oversight.