Agent-as-a-Judge: Evaluate Agents with Agents

Paper · arXiv 2410.10934 · Published October 14, 2024

Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes—ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems—by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

Introduction. Recent years have seen multimodal agentic systems move from occasionally being able to solve small toy problems to being regularly deployed for challenging real-world problems (the dream of most AI research). Yet, the current evaluation methods and the available benchmarks for agentic systems are struggling to keep up with these rapid advances, dramatically slowing true progress. We believe that the current issue with evaluating agentic systems stems from the lack of feedback during the intermediate task-solving stages for these nontraditional systems. Agentic systems think more like humans, often act step-by-step (Wooldridge, 1999) and often host very human-like symbolic communications internally to solve problems (Zhuge et al., 2023). And thus agentic systems should be evaluated like a human, with rich evaluative feedback which looks at the full thought and action trajectory; evaluating an agentic system in the traditional way is like evaluating a student using multiple-choice testing—a comparatively unreliable estimator (Park, 2010).

Discussion / Conclusion. Outlook 1: Intermediate Feedback for Agentic Self-Improvement A key power of the Agent-as-a-Judge, though not fully exploited here but nonetheless clear, is that it provides intermediate feedback that is essential for effective and efficient optimization (Zhuge et al., 2024; Pan et al., 2024). For example, Agarwal et al. (2019) proposes to solve the sparse reward problem in reinforcement learning, by learning auxiliary reward functions that provide intermediate feedback. Perhaps the greatest strength of the Agent-as-a-Judge framework is that an agentic system can use it to identify and fix issues in its solutions to complex, multistage problems on the fly—something older, delayed-feedback methods did not permit. By introducing Agent-as-a-Judge, we create the opportunity to build a process-supervised reward model (PRM) for improving agentic systems (Lightman et al., 2023).

Agent-as-a-Judge: Evaluate Agents with Agents

Synthesis notes that discuss concepts related to this paper