What is it about?

Software bugs are inevitable in development, often causing system failures and consuming vast amounts of time to fix. Automated Program Repair (APR) aims to use Artificial Intelligence to identify and repair these errors automatically. However, many existing AI approaches rely on a "brute force" strategy, generating thousands of potential fixes for a single bug, which is computationally expensive and overwhelming for developers to review. This study investigates a more balanced, developer-friendly approach. Instead of generating thousands of solutions, we restrict the AI to a maximum of 10 attempts per bug. We utilize "instruction-tuned" Large Language Models (like Llama 3 and DeepSeek) and test whether it is better to generate several guesses at once or to use an iterative process where the AI learns from error messages to refine its code. We also explore how much training data is actually needed to make these models effective at fixing bugs.

Featured Image

Why is it important?

This work is unique because it prioritizes practical usability and resource efficiency over raw numbers. By limiting the AI to a strict budget of 10 patches, we simulate real-world constraints faced by developers who cannot sift through endless suggestions. Our findings challenge the prevailing belief that "more data is always better"; we demonstrate that fine-tuning models on a very small dataset (less than 1% of the available data) can yield performance improvements of up to 78%, whereas larger datasets can actually lead to overfitting and worse results. Furthermore, we identify a crucial trade-off: while specialized (fine-tuned) models are great at fixing simple bugs quickly, general (base) models are better at listening to feedback and fixing complex problems over time.

Perspectives

Writing this paper highlighted how the principle of "less is more" applies to Large Language Models in software engineering. We were surprised to find that feeding the models massive amounts of training data often hurt their ability to think flexibly, causing them to memorize patterns rather than reasoning through errors. It was equally intriguing to observe the distinct "personalities" of the models: base models acted like students who improved significantly when told why their code failed, while fine-tuned models behaved like rigid experts—they either solved the problem immediately or struggled to adapt. We hope this work encourages the community to move away from simply scaling up data and towards designing smarter, iterative interactions between humans, AI, and compiler feedback.

Fernando Vallecillos Ruiz
Simula Research Laboratory

Read the Original

This page is a summary of: The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models, June 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3756681.3756966.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page