What is it about?
AI tools can explain homework problems and even "tutor" students, but tutoring is not the same as answering. A helpful tutor notices patterns in how students misunderstand a topic and gives feedback that targets those misunderstandings. Without good tests, AI tutors can sound persuasive while still missing what the student actually got wrong. We are building a benchmark to test whether AI tutors and graders can recognize why a student is wrong, not just produce an answer. We generate realistic wrong answers based on known misconceptions and validate the approach using existing labeled data, with the long-term goal of grounding the benchmark in de-identified student work from real classrooms.
Featured Image
Photo by Sigmund on Unsplash
Why is it important?
AI tutors and graders are arriving faster than our ability to evaluate them. Today, many systems can produce fluent explanations, but fluency is not the same as learning support. Our work proposes a benchmark centered on misconception diagnosis, the ability to identify why a student is wrong and respond appropriately. Unlike evaluations that focus mainly on correctness, we test whether models can match common misconceptions to student-like errors. We also tackle the data bottleneck with a semi-synthetic approach designed to scale while still being grounded in real student work when available, enabling faster iteration and more trustworthy adoption.
Read the Original
This page is a summary of: LLMTutorBench: A Benchmark for University-level TCS AI Tutoring Systems, February 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3770761.3777018.
You can read the full text:
Contributors
The following have contributed to this page







