SYNTHESIS NOTE

Can agents evaluate AI outputs more reliably than language models?

Does active evidence collection through tool use reduce judge inconsistency compared to passive reading-based evaluation? This matters for benchmarking AI systems where evaluation reliability directly affects research validity.

Synthesis note · 2026-02-23 · sourced from Agents Multi

LLM-as-a-Judge evaluates outputs by reading them and scoring. Agent-as-a-Judge evaluates by actively investigating — collecting dynamic evidence through tool use before making judgments. The difference in reliability is dramatic: on complex software engineering tasks with dependencies between requirements, Agent-as-a-Judge shows a judge shift of 0.27% from human consensus while LLM-as-a-Judge reaches 31.24%.

The architecture has eight modular components: (1) a graph module capturing project structure and dependencies, (2) a locate module identifying relevant files, (3) a read module understanding multimodal data across 33 formats, (4) a search module for contextual code understanding, (5) a retrieve module extracting information from long texts, (6) an ask module making pass/fail determinations, (7) a memory module storing historical judgments, and (8) a planning module strategizing next actions.

The design mirrors how human evaluators actually work — 58 hours of initial human evaluation followed by 28.5 additional hours of consensus-building debate. The human process itself requires investigation, not just reading. Single-pass evaluation is fundamentally inadequate for tasks where understanding requires traversing dependencies and cross-referencing evidence.

However, the memory module proved detrimental: errors in previous judgments cascade into current decisions, creating a chain of errors. Historical judgment information was supposed to help assess current requirements but instead propagated mistakes. This is a crucial design finding — agentic evaluation systems need error isolation mechanisms, not just more context.

Since Can LLM judges be fooled by fake credentials and formatting?, Agent-as-a-Judge addresses these biases structurally: the agent grounds its judgment in collected evidence rather than relying on heuristic pattern-matching. And since Can LLM judges be tricked without accessing their internals?, the agentic approach offers a path toward more robust evaluation — but only if the error cascade problem is solved.

Inquiring lines that read this note 170

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can agents evaluate AI outputs more reliably than language models?

Inquiring lines that read this note 170

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4