What is it about?
surveying benchmarks used to evaluate LLMs in software engineering tasks (e.g., code generation, program repair). Exploring how these datasets were created and how their quality is ensured. Checking how they handle and mitigate data contamination. Reviewing the evaluation pipelines and metrics used to test LLMs on these tasks. Providing future directions for creating better benchmark datasets.
Featured Image
Photo by Árpád Czapp on Unsplash
Why is it important?
This survey provides an evidence-based critique of the LLM benchmark datasets in software engineering, exposing hidden flaws like data contamination. Also, this work will help researchers choose the best benchmarks for their specific software engineering tasks and provide an actionable roadmap to create better, real-world benchmarks.
Read the Original
This page is a summary of: Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence, ACM Transactions on Software Engineering and Methodology, March 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3800957.
You can read the full text:
Contributors
The following have contributed to this page







