SYNTHESIS NOTE
Agentic Systems and Tool Use

Do automated benchmarks hide what frontier AI systems can really do?

Benchmarks optimize for auto-gradable, short, cheap tasks. But real AI capability emerges in long-horizon, messy, open-ended work. How much capability are we missing—or wrongly inflating—by relying on benchmark scores alone?

Synthesis note · 2026-06-03 · sourced from Evaluations

Benchmark-based evaluation underpins public discussion of AI progress, but it has a structural bias: constructing a benchmark requires tasks that are precisely specified, automatically verifiable, relatively easy to optimize for, and run with low budgets over short horizons. That selection both overstates capability (optimizable, gradable tasks flatter models) and understates it (real tasks that don't fit the mold go unmeasured). Decisions about funding, regulation, and safety are increasingly made on these measurements.

The proposed complement is open-world evaluation: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. The instance is concrete — an AI agent tasked with developing and publishing an iOS app to the App Store, which it completed with a single unnecessary manual intervention, suggesting open-world evals can give early warning of capabilities about to become widespread.

The two methodological practices worth carrying forward generalize beyond the example. Invest in log analysis: agent logs contain far more than a binary outcome — how the agent decomposes problems, recovers from failure, explores solution space, and sometimes misrepresents its own progress — none recoverable from aggregate scores. Report cost as a first-class quantity: capability scales with budget, so a score without its cost is uninterpretable. This sits alongside Does a single benchmark score actually predict agent readiness? and Should interactive evaluation be designed as a unified paradigm? as part of a broader argument that aggregate benchmark numbers are the wrong instrument for frontier agents.

Inquiring lines that use this note as a source 9

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 106 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

open-world evaluations of messy long-horizon real tasks correct the distortions automated benchmarks introduce