Do interactive evaluations actually solve the benchmark comparison problem?
Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?
The seductive promise of interactive evaluation is that richer evidence solves the problems of response-centered benchmarks. The position paper resists this. Its analysis shows that longstanding evaluation challenges — comparability, reproducibility, the validity of the evidence-to-judgment mapping, what claims a score actually supports — reappear at the trajectory level rather than disappearing. Scoring a path instead of an endpoint does not escape the core difficulty of evaluation; it relocates it into a higher-dimensional space where it is, if anything, harder to pin down.
This is why the paper frames the situation as a question demanding design, not a solution already in hand. A trajectory admits many scoring choices, and different interactive benchmarks make incompatible ones, so their results are not interchangeable — the same fragmentation that response benchmarks eventually had to standardize away, now recurring with more degrees of freedom. Process quality, recoverability, and coordination are genuinely informative, but each introduces its own version of the old questions: what counts as evidence, how is it aggregated into a judgment, and what does the resulting number license you to claim?
Why it stays open: the honest reading is that interactive evaluation buys richer evidence at the cost of reintroducing every hard problem at a new scale. The field's task is therefore not to adopt the format but to build the protocols, robustness tests, shared infrastructure, and reporting standards that make trajectory scores interpretable — work that is unfinished. Treating the new paradigm as a fix would repeat the mistake; treating it as a design problem is the corrective the paper argues for.
Inquiring lines that use this note as a source 23
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do benchmark designers treat content effects as confounds?
- What makes trajectory more actionable than absolute scores for human moderators?
- How do surface correlations between narratives and answers mislead benchmark validity?
- Should benchmark evaluations use multiple prompt formulations for difficult tasks?
- Should benchmarks measure trace length or whether constraints were actually satisfied?
- What deployment context determines which benchmark mode actually matters?
- Why do current benchmarks fail to match user satisfaction with search results?
- What is the gap between benchmark performance and real workplace task completion?
- How do trajectory quality and memory hygiene differ as evaluation metrics?
- Does longer interaction horizon require fundamentally different evaluation approaches?
- Why do short interaction benchmarks fail to predict long horizon performance?
- What reporting standards would make interactive evaluation scores comparable across benchmarks?
- How can interactive evaluation avoid replicating fragmentation problems from response-centered benchmark culture?
- What makes a trajectory score interpretable across different interactive benchmarks?
- Why does enlarging the evaluation unit reintroduce comparability problems?
- Should long horizon performance be measured as a separate evaluation axis?
- What evaluation structure would capture deployment readiness instead of benchmark scores?
- How do open-world evaluations correct distortions that automated benchmarks introduce?
- Can a single axis benchmark ever represent deployment readiness accurately?
- How do static benchmarks fail to capture human preference alignment?
- How do live human evaluations differ from ground-truth benchmarks?
- What real-world tasks most clearly expose gaps between benchmark performance and actual capability?
- Can open-world evaluations become a scalable paradigm without becoming the next benchmark trap?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Should interactive evaluation be designed as a unified paradigm?
As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
the reappearance of old challenges is the central motivation for designing rather than adopting
-
How should we evaluate agent behavior beyond final answers?
Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
the evidence expansion that creates the higher-dimensional space where old problems recur
-
Should we evaluate deployed agents as whole environments instead?
Conventional LLM evaluation focuses on models or individual episodes, but what if the right measurement unit is the entire coupled human-agent system including memory, tools, and protocols observed over time?
extends: enlarging the unit of evaluation is precisely what reintroduces comparability and reproducibility problems at the new scale
-
What should we actually measure in agent evaluation?
Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
exemplifies the new dimensions (memory hygiene, verification cost) that each carry their own version of the old evidence-to-judgment questions
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Interactive Evaluation Requires a Design Science
- UserBench: An Interactive Gym Environment for User-Centric Agents
- Evaluation and Benchmarking of LLM Agents: A Survey
- Deep Research: A Systematic Survey
- Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
- The Missing Layer of AGI: From Pattern Alchemy to Coordination Physics
- Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications
- Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Original note title
longstanding evaluation challenges reappear at the trajectory level rather than disappearing