Evaluation and Benchmarking of LLM Agents: A Survey

Mahmoud Mohammadi; Yipeng Li; Jane Lo; Wendy Yip

doi:10.1145/3711896.3736570

What is it about?

AI agents powered by large language models (LLMs)—like those used in chatbots or automated assistants—are becoming common in everyday tools and business workflows. But testing how well they actually perform remains unclear. Our survey offers a practical guide for evaluating these agents: what to measure (like accuracy, reliability, and safety) and how to measure it (with tools, data, and methods). We also highlight real-world challenges like privacy rules and the need for consistent behavior over time—making this work especially relevant for companies deploying AI systems in practice.

Why is it important?

As AI agents powered by large language models (LLMs) are increasingly used in real-world applications, the ability to evaluate them effectively has become a pressing challenge. In working with these systems, we found that evaluation goes far beyond simple accuracy—it involves understanding how agents behave over time, how they interact with tools, and whether they comply with enterprise needs like data access and security. Yet the landscape of research and practice is full of different tools, benchmarks, and evaluation methods. This paper offers a framework to organize that complexity. By bringing structure and clarity to a rapidly evolving field, this work supports both researchers and practitioners in building more trustworthy, safe, and scalable agentic AI systems.

Perspectives

This work grew out of both curiosity and necessity. While reading many agent papers and working on LLM-based agents in real-world projects, I found an ecosystem full of tools, concepts, and benchmarks with little guidance on how they all fit together. I felt a strong need to organize this space, both to better understand it myself and to create a mental map that connects research efforts with practical deployment. This paper reflects our effort to bring structure and clarity to an area that’s still rapidly evolving. I hope it helps others—whether researchers, engineers, or product teams—build more reliable and useful AI agents.
Mahmoud Mohammadi
SAP

This page is a summary of: Evaluation and Benchmarking of LLM Agents: A Survey, August 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3711896.3736570.
You can read the full text:

Read

Contributors

The following have contributed to this page

Mahmoud Mohammadi
SAP

Evaluation and Benchmarking of LLM Agents: A Survey

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Evaluation and Benchmarking of LLM Agents: A Survey

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management