Survey on Evaluation of LLM-based Agents
The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) applicationspecific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address—particularly in assessing costefficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.
Introduction. Recent years saw a huge leap in the ability of Large Language Models (LLMs) to address a wide range of challenging tasks. Yet, LLMs are static models that are restricted to single-turn, text-to-text interactions. LLM-based agents, hereinafter also referred to as LLM agents, take the power of LLMs a step further by integrating them into a multi-step flow, while maintaining a state that is shared by multiple LLM calls, providing context and consistency. They also utilize external tools to perform computations, access external knowledge and interact with their environment. Agents are able to autonomously conceive, execute and adapt complex plans in real-world environments. This newfound agency empowers them to address problems previously beyond the reach of AI, paving the way for innovative applications across a wide spectrum of domains. Reliable evaluation of agents is critical to ensure their efficacy in real-world applications, and to guide further progress in this rapidly evolving field.
Discussion / Conclusion. Our review of the evolution of agent benchmarking highlights several converging trends shaping the field. We identify two primary motives exhibited in the development of new evaluation methodologies, which we outline in the subsequent discussion. Realistic and Challenging Evaluation. Early agent evaluations often relied on simplified, static environments. However, there is a clear shift toward benchmarks that more accurately reflect realworld complexities. In web agent evaluation, for example, we have moved from basic simulations like MiniWob to dynamic online environments like WebArena and VisualWebArena, and from LAB- Bench which is static and narrow to Discovery- World for scientific agents. In software engineering, SWE-bench utilizes real-world GitHub issues, moving beyond synthetic coding problems. This shift toward realism is key to evaluating agents in real-world scenarios, capturing interaction nuances missed by simpler benchmarks.