SYNTHESIS NOTE

Should agent evaluation measure more than task success?

Current benchmarks reduce agents to a single success score, but agents emerge from multiple interacting systems. What dimensions of agent behavior should builders actually measure to predict deployment readiness?

Synthesis note · 2026-05-28 · sourced from Agent Harness

Agent evaluation has inherited the model-centric habit of reducing performance to a single number: final-task success or benchmark accuracy. The "system scaling" framing argues this framing is increasingly inadequate, because agent behavior emerges from the interaction of the foundation model with a memory substrate, a context constructor, a skill-routing layer, an orchestration loop, and a verification-and-governance layer. A one-shot success score collapses all of this into a binary that hides how the agent got there. Two agents with identical task-success rates can differ enormously in how much they spent, how much context they wasted, how clean their memory stayed, and how reliably they verified their own actions.

The proposed alternative is a research agenda for harness-level benchmarks that measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. The point is that the same model "projected onto different harnesses produce qualitatively different agents" — so evaluation must measure the system, not just the model. The counterpoint is that multi-dimensional metrics are harder to optimize and compare, and task success remains the outcome users ultimately care about. But success-only scores create false confidence in deployment readiness. This matters because it tells builders what to instrument: the process, not only the outcome.

Inquiring lines that read this note 56

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What dimensions of recommendation quality do standard metrics miss?

Can standard accuracy metrics miss the real constraints on user consumption?

What drives capability and cost efficiency in agent systems?

Why do agents confidently report success despite actually failing tasks?

Can single-axis benchmarks accurately predict agent deployment success?

When should tasks involve human-AI partnership versus full automation?

What task characteristics determine whether humans or agents should handle work?

Why do persona-level simulations fail to predict individual preferences accurately?

Why do current evaluation metrics fail to catch reasoning failures in persona agents?

What coordination failures limit multi-agent LLM systems as they scale?

How do single-agent safety evaluations underestimate risks in deployed multi-agent systems?

Why do benchmark improvements fail to reflect actual reasoning quality?

What is the gap between benchmark performance and real workplace task completion?

How should agents balance memory condensation to optimize context efficiency?

Can ensemble evaluation methods reduce bias more than single judges?

Does externalizing cognitive work and state improve agent reliability?

How should conversational agents balance goal-driven initiative with user control?

What makes idle window detection valuable for continuous agent improvement?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Should agents use parallel or sequential scaling during test time?

How do we evaluate AI systems when user perception misleads actual performance?

Do harness improvements transfer across model scales or memorize shortcuts?

What components of agent scaffolding most impact domain-specific output quality?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

Why do static benchmarks miss frontier capabilities that open-world tasks reveal?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 108 in 2-hop network ·medium cluster Open in graph ↗

Should agent evaluation measure more than task s… Does a single benchmark score actually predict age… Does agent efficiency really break down into three… Do phone agents succeed at all three critical task… How should we evaluate agent behavior beyond final…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does a single benchmark score actually predict agent readiness? Single-axis benchmarks rank models by one capability—like task success—but ignore privacy, duration, operating mode, and ecosystem fit. Can one number really capture what matters for deployment?
directly supports moving from one number to multiple axes of agent evaluation
Does agent efficiency really break down into three distinct components? Can we understand agent efficiency as three independent optimization problems—memory, tool use, and planning—each with separate cost drivers? This matters because it could explain why point optimizations keep missing the bigger picture.
operationalizes the context-efficiency and verification-cost dimensions this note calls for
Do phone agents succeed at all three critical tasks equally? Explores whether task success, privacy compliance, and preference reuse develop together in phone-use agents, or whether benchmarking one capability tells you nothing about the others.
concrete evidence that success-only scores overstate readiness
How should we evaluate agent behavior beyond final answers? Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
synthesizes: the same shift framed as a design-science move — this note lists the harness dimensions to instrument, that note formalizes the evidence expansion (final response → trajectory) underlying them

Should agent evaluation measure more than task success?

Inquiring lines that read this note 56

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4