Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

Aayush Garg; Zanis Ali Khan; Renzo Degiovanni; Qiang Tang

doi:10.1145/3748522.3779743

What is it about?

Every day, new security weaknesses are found in the software we all rely on, and fixing them by hand is slow and expensive. Large Language Models (LLMs), such as GPT-4 and LLaMA, have shown promise in automatically generating code, but can they reliably patch real security vulnerabilities? We tested 14 popular LLMs on both real-world security flaws and artificially created ones to find out. Rather than simply checking whether the AI-generated code compiled, we ran actual exploit tests, simulated attacks designed to trigger the vulnerability, to see whether the AI's fix truly eliminated the threat. Our results show that LLMs are noticeably better at patching well-known real vulnerabilities than unfamiliar artificial ones. When a real bug is within their reach, multiple models tend to converge on the same correct fix. However, when faced with novel variations of those bugs, their performance drops significantly. No single model was the best at everything, meaning the right choice of AI tool depends on the specific vulnerability being addressed.

Photo by Christian Panta on Unsplash

Why is it important?

Organizations worldwide are racing to adopt AI tools for software development, including for security-critical tasks. Yet until now, most evaluations of AI-based vulnerability patching relied on superficial metrics, like whether the generated code looked similar to the human fix, rather than rigorously testing whether the exploit was truly neutralized. Our study is among the first to use execution-based Proof-of-Vulnerability tests across a broad set of both real and artificial vulnerabilities. The findings offer practical, evidence-based guidance for developers and security teams choosing which LLMs to trust for patching, and temper expectations about AI's ability to handle unfamiliar security threats.

Perspectives

What struck us most was the gap between how confident LLMs appear when generating code and how fragile that confidence turns out to be when the vulnerability is even slightly unfamiliar. The models can produce fluent, convincing patches, yet our Proof-of-Vulnerability tests repeatedly revealed that the underlying threat remained. This reinforces a lesson we think the broader community needs to internalize, primarily, execution-based validation is non-negotiable when AI touches security-critical code. We are optimistic that LLMs will become powerful allies for defenders, but our work shows we are not yet at the point where they can be trusted unsupervised for patching novel threats.
Aayush Garg
Luxembourg Institute of Science and Technology

This page is a summary of: Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities, March 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3748522.3779743.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page

Aayush Garg
Luxembourg Institute of Science and Technology

Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

What is it about?

Why is it important?

Perspectives

Resources

Extended version preprint of Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

Poster of Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Resources

Extended version preprint of Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

Poster of Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management