What should we actually measure in agent evaluation?
Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
Agent evaluation has inherited the model-centric habit of reducing performance to a single number: final-task success or benchmark accuracy. The "system scaling" framing argues this framing is increasingly inadequate, because agent behavior emerges from the interaction of the foundation model with a memory substrate, a context constructor, a skill-routing layer, an orchestration loop, and a verification-and-governance layer. A one-shot success score collapses all of this into a binary that hides how the agent got there. Two agents with identical task-success rates can differ enormously in how much they spent, how much context they wasted, how clean their memory stayed, and how reliably they verified their own actions.
The proposed alternative is a research agenda for harness-level benchmarks that measure trajectory quality, memory hygiene, context efficiency, communication fidelity, verification cost, and safe evolution over time. The point is that the same model "projected onto different harnesses produce qualitatively different agents" — so evaluation must measure the system, not just the model. The counterpoint is that multi-dimensional metrics are harder to optimize and compare, and task success remains the outcome users ultimately care about. But success-only scores create false confidence in deployment readiness. This matters because it tells builders what to instrument: the process, not only the outcome.
Inquiring lines that use this note as a source 51
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can standard accuracy metrics miss the real constraints on user consumption?
- When should you optimize agent behavior versus tool performance separately?
- Why do agents report success when they have actually failed at tasks?
- How does the execution layer constrain agent performance in tool use?
- How should domain-specific AI be evaluated differently from general benchmarks?
- How much does agent performance depend on demonstration quantity versus curation quality?
- What task characteristics determine whether humans or agents should handle work?
- Why do 85 percent of production agents avoid third-party frameworks?
- Can automated evaluation replace human judgment in agent testing?
- What tasks do AI agents still fail at most often?
- Why do current evaluation metrics fail to catch reasoning failures in persona agents?
- How do single-agent safety evaluations underestimate risks in deployed multi-agent systems?
- Why do benchmark scores not capture the true nature of AI systems?
- Why do completion-mode strengths not transfer to agentic settings?
- How do mode-specific failures differ between completion and agent benchmarks?
- Should agent capability be optimized separately from general capability?
- What deployment context determines which benchmark mode actually matters?
- Why do agents report success when actions actually fail?
- What is the gap between benchmark performance and real workplace task completion?
- Which ecosystem conditions matter most for agent deployment success?
- How should we measure context efficiency and verification cost in agents?
- What makes some agent benchmarks measure interaction quality better than others?
- How do evaluation methods differ for single versus multi-agent systems?
- How should benchmarks measure agent efficiency across all three cost dimensions?
- Where does agent reliability come from if not better tools?
- Can single benchmarks predict whether an agent will work in the real world?
- What role does runtime feedback play in agent verification and progress confirmation?
- What makes idle window detection valuable for continuous agent improvement?
- What metrics replace throughput per token for agent deployment?
- What reporting standards would make interactive evaluation scores comparable across benchmarks?
- Should artifact-level benchmarks replace token counts for agent evaluation?
- Should production agents execute one tool or multiple tools per invocation?
- Can high benchmark scores mislead deployment decisions for search agents?
- Should long horizon performance be measured as a separate evaluation axis?
- What evaluation structure would capture deployment readiness instead of benchmark scores?
- Does single-capability ranking guarantee agent failure in production deployment?
- How do agent privacy compliance and task success differ in evaluation?
- What governance and safety measurements matter for deployed agent environments?
- How do perception and execution gaps limit current AI agent performance?
- Why do estimates for task-level performance differ so much from full job automation timelines?
- Should agents use parallel or sequential scaling during test time?
- How do open-world evaluations correct distortions that automated benchmarks introduce?
- Can a single axis benchmark ever represent deployment readiness accurately?
- What components of agent scaffolding most impact domain-specific output quality?
- Should optimal context budgets scale with agent competence or task complexity?
- How do live human evaluations differ from ground-truth benchmarks?
- Why do static benchmarks miss frontier capabilities that open-world tasks reveal?
- Which agent architectures consistently outperform base models on hard prediction questions?
- What hidden signals in agent logs reveal about frontier capability beyond pass-fail outcomes?
- What real-world tasks most clearly expose gaps between benchmark performance and actual capability?
- Can open-world evaluations become a scalable paradigm without becoming the next benchmark trap?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does a single benchmark score actually predict agent readiness?
Single-axis benchmarks rank models by one capability—like task success—but ignore privacy, duration, operating mode, and ecosystem fit. Can one number really capture what matters for deployment?
directly supports moving from one number to multiple axes of agent evaluation
-
Does agent efficiency really break down into three distinct components?
Can we understand agent efficiency as three independent optimization problems—memory, tool use, and planning—each with separate cost drivers? This matters because it could explain why point optimizations keep missing the bigger picture.
operationalizes the context-efficiency and verification-cost dimensions this note calls for
-
Do phone agents succeed at all three critical tasks equally?
Explores whether task success, privacy compliance, and preference reuse develop together in phone-use agents, or whether benchmarking one capability tells you nothing about the others.
concrete evidence that success-only scores overstate readiness
-
How should we evaluate agent behavior beyond final answers?
Current agent evaluation focuses on endpoint correctness, but agentic systems unfold over time through interaction trajectories. What evidence and scoring methods should we use to capture process quality, recovery, and coordination?
synthesizes: the same shift framed as a design-science move — this note lists the harness dimensions to instrument, that note formalizes the evidence expansion (final response → trajectory) underlying them
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Survey on Evaluation of LLM-based Agents
- Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
- Towards a Science of Scaling Agent Systems
- LLMs Corrupt Your Documents When You Delegate
- From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
- Open-World Evaluations for Measuring Frontier AI Capabilities
- Evaluation and Benchmarking of LLM Agents: A Survey
- Artifacts as Memory Beyond the Agent Boundary
Original note title
agent evaluation must move beyond one-shot task success to trajectory quality memory hygiene context efficiency and verification cost