SYNTHESIS NOTE

Do interactive evaluations actually solve the benchmark comparison problem?

Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?

Synthesis note · 2026-05-28 · sourced from Evaluations

The seductive promise of interactive evaluation is that richer evidence solves the problems of response-centered benchmarks. The position paper resists this. Its analysis shows that longstanding evaluation challenges — comparability, reproducibility, the validity of the evidence-to-judgment mapping, what claims a score actually supports — reappear at the trajectory level rather than disappearing. Scoring a path instead of an endpoint does not escape the core difficulty of evaluation; it relocates it into a higher-dimensional space where it is, if anything, harder to pin down.

This is why the paper frames the situation as a question demanding design, not a solution already in hand. A trajectory admits many scoring choices, and different interactive benchmarks make incompatible ones, so their results are not interchangeable — the same fragmentation that response benchmarks eventually had to standardize away, now recurring with more degrees of freedom. Process quality, recoverability, and coordination are genuinely informative, but each introduces its own version of the old questions: what counts as evidence, how is it aggregated into a judgment, and what does the resulting number license you to claim?

Why it stays open: the honest reading is that interactive evaluation buys richer evidence at the cost of reintroducing every hard problem at a new scale. The field's task is therefore not to adopt the format but to build the protocols, robustness tests, shared infrastructure, and reporting standards that make trajectory scores interpretable — work that is unfinished. Treating the new paradigm as a fix would repeat the mistake; treating it as a design problem is the corrective the paper argues for.

Inquiring lines that read this note 25

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can identical external performance mask different internal representations?

Can ensemble evaluation methods reduce bias more than single judges?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can single-axis benchmarks accurately predict agent deployment success?

What dimensions of recommendation quality do standard metrics miss?

Why do current benchmarks fail to match user satisfaction with search results?

What memory abstraction level best enables agent knowledge reuse?

How do trajectory quality and memory hygiene differ as evaluation metrics?

How do we evaluate AI systems when user perception misleads actual performance?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 90 in 2-hop network ·medium cluster Open in graph ↗

Do interactive evaluations actually solve the be… Should interactive evaluation be designed as a uni… How should we evaluate agent behavior beyond final… Should we evaluate deployed agents as whole enviro… Should agent evaluation measure more than task suc…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Should interactive evaluation be designed as a unified paradigm? As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
the reappearance of old challenges is the central motivation for designing rather than adopting
How should we evaluate agent behavior beyond final answers? Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
the evidence expansion that creates the higher-dimensional space where old problems recur
Should we evaluate deployed agents as whole environments instead? Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?
extends: enlarging the unit of evaluation is precisely what reintroduces comparability and reproducibility problems at the new scale
Should agent evaluation measure more than task success? Current benchmarks reduce agents to a single success score, but agents emerge from multiple interacting systems. What dimensions of agent behavior should builders actually measure to predict deployment readiness?
exemplifies the new dimensions (memory hygiene, verification cost) that each carry their own version of the old evidence-to-judgment questions

Do interactive evaluations actually solve the benchmark comparison problem?

Inquiring lines that read this note 25

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 3