SYNTHESIS NOTE

Should interactive evaluation be designed as a unified paradigm?

As AI systems increasingly interact over time with tools and environments, evaluation practice must evolve. Should interactive evaluation be treated as a principled design science with shared protocols, or adopted incrementally as new benchmarks?

Synthesis note · 2026-05-28 · sourced from Evaluations

AI evaluation is undergoing a structural change: models are increasingly deployed as systems that act over time through tools, environments, users, and other agents. Yet most evaluation practice still inherits response-centered assumptions — fixed inputs, isolated outputs, a judgment made from a single response. Interactive benchmarks have proliferated, but the landscape is fragmented: they disagree on what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This paper's position is that interactive evaluation should be treated as a principled paradigm, not as the next family of agent benchmarks to collect.

The argument turns on a definition: evaluation is an autonomous mapping E: X → Y from admissible evidence X to judgments Y. Interactive evaluation changes both sides. The evidence X expands from final responses to interaction-generated trajectories; the procedure E must assess not just final correctness but process quality, recoverability, coordination, safety, efficiency, and robustness. From this the authors build a two-axis taxonomy (what artifacts enter; how they map to judgments), derive design principles and reporting standards, and locate where current benchmarks concentrate and what they miss.

Why it matters: the distinction between designing and adopting is the whole point. Adopting interactive benchmarks one at a time produces incomparable, non-reproducible, non-extensible scores — the same fragmentation that plagued early benchmark culture, now at the trajectory level. Treating interactive evaluation as a design science forces explicit protocols, richer trajectory measures, shared infrastructure, and reporting standards that make scores interpretable. The counterpoint the paper concedes: response-centered evaluation remains useful — it is insufficient, not wrong — so the paradigm shift is additive, expanding what counts as evidence rather than discarding the old measures.

Inquiring lines that read this note 6

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does AI assistance affect human cognitive development and reasoning autonomy?

How does the evaluator become part of the definition of intelligence?

Can ensemble evaluation methods reduce bias more than single judges?

How do we evaluate AI systems when user perception misleads actual performance?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 116 in 2-hop network ·medium cluster Open in graph ↗

Should interactive evaluation be designed as a u… How should we evaluate agent behavior beyond final… Do interactive evaluations actually solve the benc… Will AI automation eventually formalize designer t…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How should we evaluate agent behavior beyond final answers? Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
the concrete shift in evidence and scoring that this paradigm formalizes
Do interactive evaluations actually solve the benchmark comparison problem? Interactive, trajectory-based evaluation promises richer evidence than response-only benchmarks. But does moving to this format resolve longstanding challenges like comparability and reproducibility, or do those problems simply reappear at a new scale?
why a new format does not escape old problems, motivating the design-science framing
Will AI automation eventually formalize designer taste? Designers argue taste is the irreducible human element AI cannot replicate. But does the same automation pattern that formalized other skilled work suggest taste itself will become the next layer to be encoded into evaluation systems?
both treat evaluation design itself as the contested, formalizable layer of the work

Should interactive evaluation be designed as a unified paradigm?

Inquiring lines that read this note 6

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4