What is it about?

surveying benchmarks used to evaluate LLMs in software engineering tasks (e.g., code generation, program repair). Exploring how these datasets were created and how their quality is ensured. Checking how they handle and mitigate data contamination. Reviewing the evaluation pipelines and metrics used to test LLMs on these tasks. Providing future directions for creating better benchmark datasets.

Featured Image

Why is it important?

This survey provides an evidence-based critique of the LLM benchmark datasets in software engineering, exposing hidden flaws like data contamination. Also, this work will help researchers choose the best benchmarks for their specific software engineering tasks and provide an actionable roadmap to create better, real-world benchmarks.

Read the Original

This page is a summary of: Surveying the Benchmarking Landscape of Large Language Models in Code Intelligence, ACM Transactions on Software Engineering and Methodology, March 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3800957.
You can read the full text:

Read

Contributors

The following have contributed to this page