How should we evaluate agent behavior beyond final answers?
Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
If evaluation is the map E: X → Y from admissible evidence to judgments, then the shift to agentic systems changes both terms in a parallel, recurring way. On the evidence side (X), the unit expands from a single final response to a full interaction-generated trajectory — the sequence of states, actions, tool calls, and environment responses produced as the system acts in closed loop. On the procedure side (E), final correctness is no longer sufficient; the evaluator must additionally score process quality, recoverability (can the agent get back on track after an error?), coordination (across tools, environments, other agents), robustness, efficiency, and system-level performance.
This is a pattern, not a single metric, because the same expansion recurs across otherwise unrelated agent benchmarks. T-Eval scores whether each predicted tool call matches the expected one; AgentBoard's Progress Rate compares the actual trajectory against the expected trajectory; multi-agent frameworks score collaborative efficiency and how well agents distribute tasks dynamically. Each is an instance of "stop scoring the endpoint, start scoring the path." The trajectory becomes the evidence, and the qualities that only exist over time — recovery, coordination, partial progress — become the things judged.
Why it matters: this reframes a scattered set of agent metrics as a coherent move. Once you see process-recoverability-coordination scoring as the trajectory-level analogue of final-answer scoring, you can ask the design-science questions — which artifacts to admit, how to map them to judgments — systematically rather than benchmark by benchmark. The counterpoint: richer evidence is also noisier and harder to standardize, which is precisely why the expansion creates new evaluation challenges rather than dissolving the old ones.
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can agent success reports serve as reliable oversight signals in real deployment?
- What specific failure modes must evaluation catch before deploying action-capable systems?
- Can automated evaluation replace human judgment in agent testing?
- What evaluation criteria can hold across legitimate adoption and coercion?
- Why do agents report success when actions actually fail?
- What separates good workflow design from poor workflow design?
- What makes some agent benchmarks measure interaction quality better than others?
- Where does agent reliability come from if not better tools?
- What five ecosystem conditions must coordination governance and evidence actually satisfy?
- What role does runtime feedback play in agent verification and progress confirmation?
- How does evaluating interaction trajectories change what we measure beyond correctness?
- What governance and safety measurements matter for deployed agent environments?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Should interactive evaluation be designed as a unified paradigm?
As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
the paradigm whose evidence-and-procedure expansion this pattern describes concretely
-
Can trajectory structure replace hand-annotated process rewards?
Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?
operationalizes trajectory-as-evidence for training, complementing trajectory-as-evidence for evaluation
-
Does agent interaction time scale separately from reasoning depth?
Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.
the capability side: interaction-horizon abilities are exactly what trajectory-level evaluation is needed to measure
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Agent-as-a-Judge: Evaluate Agents with Agents
- Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
- Survey on Evaluation of LLM-based Agents
- Evaluation and Benchmarking of LLM Agents: A Survey
- Why Do Multi-agent LLM Systems Fail?
- Towards a Science of Scaling Agent Systems
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
- Interactive Evaluation Requires a Design Science
Original note title
agent evaluation expands evidence from final responses to interaction trajectories scoring process recoverability and coordination