Should interactive evaluation be designed as a unified paradigm?
As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?
AI evaluation is undergoing a structural change: models are increasingly deployed as systems that act over time through tools, environments, users, and other agents. Yet most evaluation practice still inherits response-centered assumptions — fixed inputs, isolated outputs, a judgment made from a single response. Interactive benchmarks have proliferated, but the landscape is fragmented: they disagree on what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This paper's position is that interactive evaluation should be treated as a principled paradigm, not as the next family of agent benchmarks to collect.
The argument turns on a definition: evaluation is an autonomous mapping E: X → Y from admissible evidence X to judgments Y. Interactive evaluation changes both sides. The evidence X expands from final responses to interaction-generated trajectories; the procedure E must assess not just final correctness but process quality, recoverability, coordination, safety, efficiency, and robustness. From this the authors build a two-axis taxonomy (what artifacts enter; how they map to judgments), derive design principles and reporting standards, and locate where current benchmarks concentrate and what they miss.
Why it matters: the distinction between designing and adopting is the whole point. Adopting interactive benchmarks one at a time produces incomparable, non-reproducible, non-extensible scores — the same fragmentation that plagued early benchmark culture, now at the trajectory level. Treating interactive evaluation as a design science forces explicit protocols, richer trajectory measures, shared infrastructure, and reporting standards that make scores interpretable. The counterpoint the paper concedes: response-centered evaluation remains useful — it is insufficient, not wrong — so the paradigm shift is additive, expanding what counts as evidence rather than discarding the old measures.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does the evaluator become part of the definition of intelligence?
- Can contextual design decisions resist formalization into evaluation rubrics?
- Does longer interaction horizon require fundamentally different evaluation approaches?
- What reporting standards would make interactive evaluation scores comparable across benchmarks?
- Why do evaluation design choices themselves become reified into the AI systems being evaluated?
- How might automated evals eventually capture the human judgment designers exercise now?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How should we evaluate agent behavior beyond final answers?
Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
the concrete shift in evidence and scoring that this paradigm formalizes
-
Do interactive evaluations actually solve the benchmark comparison problem?
Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?
why a new format does not escape old problems, motivating the design-science framing
-
Will AI automation eventually formalize designer taste?
Designers argue taste is the irreducible human element AI cannot replicate. But does the same automation pattern that formalized other skilled work suggest taste itself will become the next layer to be encoded into evaluation systems?
both treat evaluation design itself as the contested, formalizable layer of the work
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Interactive Evaluation Requires a Design Science
- Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
- Evaluation and Benchmarking of LLM Agents: A Survey
- Position: Towards Bidirectional Human-AI Alignment
- UserBench: An Interactive Gym Environment for User-Centric Agents
- Agent-as-a-Judge: Evaluate Agents with Agents
- The Method of Critical AI Studies, A Propaedeutic
- Open-World Evaluations for Measuring Frontier AI Capabilities
Original note title
interactive evaluation must be designed as a paradigm not adopted as the next benchmark format