SYNTHESIS NOTE

How should we evaluate agent behavior beyond final answers?

Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?

Synthesis note · 2026-05-28 · sourced from Evaluations

If evaluation is the map E: X → Y from admissible evidence to judgments, then the shift to agentic systems changes both terms in a parallel, recurring way. On the evidence side (X), the unit expands from a single final response to a full interaction-generated trajectory — the sequence of states, actions, tool calls, and environment responses produced as the system acts in closed loop. On the procedure side (E), final correctness is no longer sufficient; the evaluator must additionally score process quality, recoverability (can the agent get back on track after an error?), coordination (across tools, environments, other agents), robustness, efficiency, and system-level performance.

This is a pattern, not a single metric, because the same expansion recurs across otherwise unrelated agent benchmarks. T-Eval scores whether each predicted tool call matches the expected one; AgentBoard's Progress Rate compares the actual trajectory against the expected trajectory; multi-agent frameworks score collaborative efficiency and how well agents distribute tasks dynamically. Each is an instance of "stop scoring the endpoint, start scoring the path." The trajectory becomes the evidence, and the qualities that only exist over time — recovery, coordination, partial progress — become the things judged.

Why it matters: this reframes a scattered set of agent metrics as a coherent move. Once you see process-recoverability-coordination scoring as the trajectory-level analogue of final-answer scoring, you can ask the design-science questions — which artifacts to admit, how to map them to judgments — systematically rather than benchmark by benchmark. The counterpoint: richer evidence is also noisier and harder to standardize, which is precisely why the expansion creates new evaluation challenges rather than dissolving the old ones.

Inquiring lines that read this note 16

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do agents confidently report success despite actually failing tasks?

How do we evaluate AI systems when user perception misleads actual performance?

What evaluation criteria can hold across legitimate adoption and coercion?

What drives capability and cost efficiency in agent systems?

Can single-axis benchmarks accurately predict agent deployment success?

Does externalizing cognitive work and state improve agent reliability?

How should dialogue recommender systems manage conversation history and state?

How does evaluating interaction trajectories change what we measure beyond correctness?

What memory abstraction level best enables agent knowledge reuse?

How do execution traces and tests represent agent environment state?

How can AI agents autonomously learn and transfer skills across tasks?

Can a progressively stricter evaluator act like a curriculum for improving agents?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 106 in 2-hop network ·medium cluster Open in graph ↗

How should we evaluate agent behavior beyond fin… Should interactive evaluation be designed as a uni… Can trajectory structure replace hand-annotated pr… Does agent interaction time scale separately from …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Should interactive evaluation be designed as a unified paradigm? As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
the paradigm whose evidence-and-procedure expansion this pattern describes concretely
Can trajectory structure replace hand-annotated process rewards? Recent methods extract step-level supervision directly from how agent trajectories are structured—trees, expert alignments, tool calls—rather than training separate reward models. Can this structural approach consistently avoid annotation costs?
operationalizes trajectory-as-evidence for training, complementing trajectory-as-evidence for evaluation
Does agent interaction time scale separately from reasoning depth? Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.
the capability side: interaction-horizon abilities are exactly what trajectory-level evaluation is needed to measure

How should we evaluate agent behavior beyond final answers?

Inquiring lines that read this note 16

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4