ChatGPT Improves Software Security But Struggles With Complex Vulnerabilities

As large language models like GPT-4 show potential in boosting software security, researchers warn that human oversight remains crucial as AI struggles with complex, less common vulnerabilities.

Image Credit: Best-Backgrounds / Shutterstock

In a paper published in the journal Information, researchers in Germany and Portugal examined the potential of chat generative pre-trained transformers (ChatGPTs) to aid in secure software development. They drew on industry experience and previous work to conduct two experiments with large language models, comparing both GPT-3 and GPT-4 models. The study also emphasized the importance of following international standards such as IEC 62443 and ISO/IEC 27001 for secure software development, which are critical when used alongside frameworks like MITRE's CWE and OWASP to evaluate AI's role in enhancing security.

The study explored the specific advantages, challenges, and limitations of using ChatGPTs. This investigation built on the success of the CyberSecurity challenges game, which raised awareness of secure coding practices.

Related Work

Past work explored using artificial intelligence (AI) in secure software development, examining standards like International Electrotechnical Commission 62443 (IEC 62443) and International Organization for Standardization ISO/IEC 27001 to assess AI's role in enhancing software security. These standards are critical in industrial automation and control systems (IACSs), providing guidelines for secure lifecycle management, although are not yet fully adapted to account for the capabilities and limitations of generative AI.

Researchers investigated frameworks such as Massachusetts Institute of Technology Research and Engineering's (MITRE's) common weakness enumeration (CWE) and open worldwide application security project (OWASP's) top 10, noting the potential of models like ChatGPT and Meta's LLaMA in assisting developers. Studies highlighted the promise and challenges of AI tools, including vulnerability prediction with LineVul and concerns about AI code assistants like GitHub Copilot potentially producing insecure code. Reports suggest GitHub Copilot has been found to propagate insecure coding practices, raising concerns about the spread of AI-generated vulnerabilities.

LLMs in Vulnerability Assessment

The study explored how advanced LLMs like GPT-3 can identify and mitigate software vulnerabilities. Using the ChatGPT interface, five C/C++ challenges from the Sifu platform, including buffer overflow and integer overflow vulnerabilities, were tested. These interactions aimed to assess the model's capability to detect vulnerabilities and suggest effective solutions.

In one of the challenges, a code snippet demonstrated a side-channel leakage vulnerability where the function's runtime depended on its inputs, potentially revealing sensitive information. For instance, GPT-3 successfully identified common vulnerabilities like buffer overflows (CWE-121), but failed to detect more complex vulnerabilities like integer overflow (CWE-190), which are especially crucial in industrial settings. Interactions with GPT-3 involved a series of questions where the model was asked to identify the vulnerability, provide the associated common weakness enumeration (CWE) identifier, and propose a fix.

The preliminary study involved 43 interactions and covered five specific vulnerabilities. The results provided insights into GPT-3's strengths and weaknesses in recognizing rarer or more context-specific vulnerabilities. This challenge reflects a theoretical issue in AI's inability to handle non-decidable problems in secure coding, where some security vulnerabilities cannot be fully resolved through algorithmic methods.

Building on the preliminary study, an extended analysis was performed using the more advanced GPT-4 model. This phase assessed not only improvements in detection but also the model's ability to suggest practical fixes. The study included two sets of code snippets: Set 1, consisting of vulnerabilities from the sysadmin, audit, network, and security (SANS) Top 25, and Set 2, a curated collection developed to test the model's generalization abilities across various security issues.

The SANS Top 25 snippets came from the MITRE website, while the Curated Code Snippet was manually analyzed for multiple vulnerabilities. This curated set of code was designed to test GPT -4's ability to handle diverse, real-world scenarios. Using 26 snippets, the study employed prompts to evaluate GPT-4's performance in code understanding, vulnerability detection, and suggesting fixes, focusing on correctness, completeness, and relevance.

It involved assessing if the LLM correctly identified and described vulnerabilities, its ability to identify all relevant security issues, and the applicability of the suggested fixes in an industrial environment. The results were evaluated through manual analysis and discussions with industrial cybersecurity experts, providing a comprehensive view of the LLMs' effectiveness in secure software development.

Mixed Vulnerability Detection

The preliminary study with GPT-3 revealed varying levels of success in identifying software vulnerabilities across five challenges from the Sifu platform. GPT-3 was precise in pinpointing specific vulnerabilities in the first three challenges, though it did not always match the exact CWE categories provided. However, for the fourth and fifth challenges (CWE-190 and CWE-121), GPT-3 failed to identify the primary issues.

In some cases, the model proposed fixes that were inadequate or overly reliant on external libraries like the open-source OpenSSL library. It indicated that while GPT-3 had some capability in detecting and addressing vulnerabilities, it often struggled with more nuanced or specific issues.

The extended analysis with GPT-4 showed notable improvements, achieving 100% in code understanding and 88% in vulnerability detection for SANS Top 25 snippets. It identified critical vulnerabilities like CWE-787 and CWE-89 but struggled with more complex ones, such as CWE-862, and was correct in specifying CWE numbers just 56% of the time.

For the Curated Code Snippet in Set 2, GPT-4's performance was less consistent. It identified only 8 out of 20 issues, missing several critical vulnerabilities, including insecure random number generation and improper exception handling. Despite correctly suggesting specific security measures, like setting secure flags for cookies, GPT-4 failed to address significant problems, such as improper initialization and hardcoded database paths. Additionally, the model exhibited hallucinations by generating false positives and incorrect CWE numbers, highlighting challenges in dealing with more complex and less common vulnerabilities.

Conclusion

To sum up, this study comprehensively evaluated both GPT-3 and GPT-4 models in detecting and mitigating software vulnerabilities. While GPT-4 showed significant improvements over GPT-3, achieving 88% accuracy in identifying common vulnerabilities like those in the SANS Top 25, it continued to face challenges with less frequent or complex vulnerabilities. This highlights the need for careful application of these models alongside traditional security practices and expert oversight.

Journal reference:
  • Espinha Gasiba, et al. (2024). May the Source Be with You On ChatGPT, Cybersecurity, and Secure Coding. Information, 15:9, 572. DOI: 10.3390/info15090572, https://www.mdpi.com/2078-2489/15/9/572
Silpaja Chandrasekar

Written by

Silpaja Chandrasekar

Dr. Silpaja Chandrasekar has a Ph.D. in Computer Science from Anna University, Chennai. Her research expertise lies in analyzing traffic parameters under challenging environmental conditions. Additionally, she has gained valuable exposure to diverse research areas, such as detection, tracking, classification, medical image analysis, cancer cell detection, chemistry, and Hamiltonian walks.

Citations

Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Chandrasekar, Silpaja. (2024, September 23). ChatGPT Improves Software Security But Struggles With Complex Vulnerabilities. AZoAi. Retrieved on December 11, 2024 from https://www.azoai.com/news/20240923/ChatGPT-Improves-Software-Security-But-Struggles-With-Complex-Vulnerabilities.aspx.

  • MLA

    Chandrasekar, Silpaja. "ChatGPT Improves Software Security But Struggles With Complex Vulnerabilities". AZoAi. 11 December 2024. <https://www.azoai.com/news/20240923/ChatGPT-Improves-Software-Security-But-Struggles-With-Complex-Vulnerabilities.aspx>.

  • Chicago

    Chandrasekar, Silpaja. "ChatGPT Improves Software Security But Struggles With Complex Vulnerabilities". AZoAi. https://www.azoai.com/news/20240923/ChatGPT-Improves-Software-Security-But-Struggles-With-Complex-Vulnerabilities.aspx. (accessed December 11, 2024).

  • Harvard

    Chandrasekar, Silpaja. 2024. ChatGPT Improves Software Security But Struggles With Complex Vulnerabilities. AZoAi, viewed 11 December 2024, https://www.azoai.com/news/20240923/ChatGPT-Improves-Software-Security-But-Struggles-With-Complex-Vulnerabilities.aspx.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoAi.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.