What is it about?
Every day, new security weaknesses are found in the software we all rely on, and fixing them by hand is slow and expensive. Large Language Models (LLMs), such as GPT-4 and LLaMA, have shown promise in automatically generating code, but can they reliably patch real security vulnerabilities? We tested 14 popular LLMs on both real-world security flaws and artificially created ones to find out. Rather than simply checking whether the AI-generated code compiled, we ran actual exploit tests, simulated attacks designed to trigger the vulnerability, to see whether the AI's fix truly eliminated the threat. Our results show that LLMs are noticeably better at patching well-known real vulnerabilities than unfamiliar artificial ones. When a real bug is within their reach, multiple models tend to converge on the same correct fix. However, when faced with novel variations of those bugs, their performance drops significantly. No single model was the best at everything, meaning the right choice of AI tool depends on the specific vulnerability being addressed.
Featured Image
Photo by Christian Panta on Unsplash
Why is it important?
Organizations worldwide are racing to adopt AI tools for software development, including for security-critical tasks. Yet until now, most evaluations of AI-based vulnerability patching relied on superficial metrics, like whether the generated code looked similar to the human fix, rather than rigorously testing whether the exploit was truly neutralized. Our study is among the first to use execution-based Proof-of-Vulnerability tests across a broad set of both real and artificial vulnerabilities. The findings offer practical, evidence-based guidance for developers and security teams choosing which LLMs to trust for patching, and temper expectations about AI's ability to handle unfamiliar security threats.
Perspectives
What struck us most was the gap between how confident LLMs appear when generating code and how fragile that confidence turns out to be when the vulnerability is even slightly unfamiliar. The models can produce fluent, convincing patches, yet our Proof-of-Vulnerability tests repeatedly revealed that the underlying threat remained. This reinforces a lesson we think the broader community needs to internalize, primarily, execution-based validation is non-negotiable when AI touches security-critical code. We are optimistic that LLMs will become powerful allies for defenders, but our work shows we are not yet at the point where they can be trusted unsupervised for patching novel threats.
Aayush Garg
Luxembourg Institute of Science and Technology
Read the Original
This page is a summary of: Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities, March 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3748522.3779743.
You can read the full text:
Resources
Extended version preprint of Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities
Extended version preprint of Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities
Poster of Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities
Poster of Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities
Contributors
The following have contributed to this page







