Can single-axis benchmarks measure across all three agent capability layers?
This explores whether one benchmark number can capture agent ability when 'capability' is actually layered across distinct dimensions — and the corpus says no, that's the core failure of how agents get measured today.
This explores whether a single benchmark score can stretch across the separate layers that make an agent actually work — and the corpus is unusually unified in answering no. The starting point is that agent capability isn't a quantity, it's a vector. One analysis decomposes it into at least five separable axes — task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness — and shows that models topping one axis routinely rank low on others, which makes any single-score ranking systematically misleading for real deployment Does a single benchmark score actually predict agent readiness?. If the thing you're measuring has multiple independent directions, a one-number gauge can only ever project a shadow of it.
Why 'layers' specifically? Reliable agents tend to externalize three distinct cognitive burdens — memory (keeping state), skills (reusable procedures), and protocols (structured interaction) — into a harness around the model rather than expecting raw model scale to carry all three Where does agent reliability actually come from?. A single-axis benchmark almost always tests just the first of these, in a clean one-shot setting. That's why the corpus argues evaluation has to move past one-shot task accuracy toward trajectory quality, memory hygiene, context efficiency, and verification cost — because two agents with identical success rates can differ enormously in what they consume and how reliably they get there Should agent evaluation measure more than task success?.
The most concrete evidence that single-axis scores miss the other layers is the gap between contest performance and real work. Studying 960 real occupational workflows, researchers found agents that ace abstract benchmark contests fail at long-horizon professional tasks — and crucially, the failure is in benchmark design, not model capability. The field optimizes what it measures, and it has been measuring contests rather than work Why do agent benchmarks not predict real economic value?. The retention/persistence layer is a good example of something single scores erase: across 17 frontier models, the dominant predictor of ultra-long-horizon success wasn't initial answer quality but persistence through repeated feedback loops — a property invisible to any one-shot metric What predicts success in ultra-long-horizon agent tasks?.
And there's a coordination layer single benchmarks can't see at all. Multi-agent systems degrade predictably as networks scale — agents agree too late, or adopt strategies without telling neighbors, or accept information without verifying it — so a per-agent capability score tells you nothing about whether a group of them will function together Why do multi-agent systems fail to coordinate at scale?. Worse, much of what a single multi-agent number captures is just spending: roughly 80% of multi-agent performance variance traces to token budget rather than coordination intelligence, so a higher score can simply mean a bigger bill How does test-time scaling work at the agent level?.
The quietly surprising part is that capability isn't even monotonic across tiers. The capacity to produce useful harness updates is flat across model sizes, but the capacity to actually benefit from them peaks at mid-tier models in an inverted U — weak models never invoke the harness, strong ones over-improvise instead of following it Do stronger models always evolve harnesses better?. That alone breaks the assumption a single axis depends on: that 'more capable' moves one direction. The corpus's collective answer is that you can't compress layered, sometimes non-monotonic, partly-emergent-at-the-system-level capability into one number — and pretending you can is precisely how a model clears every contest and still can't do the job.
Sources 8 notes
Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
One-shot task accuracy hides critical system behavior across trajectory quality, memory hygiene, context efficiency, and verification cost. Multi-dimensional measurement is harder to optimize but essential because identical success rates mask enormous differences in resource consumption and reliability.
ALE's analysis of 960 real occupational workflows shows agents excel at abstract contests but fail long-horizon professional tasks. The gap is not model capability but benchmark design—the field optimizes what it measures, and it has measured contests rather than work.
Across 17 frontier models on 36 expert-curated optimization tasks, repeated benchmark-edit-incorporate cycles within a wall-clock budget proved the dominant success predictor. Most models terminated early or burned budget unproductively; Claude Opus 4.6 stood out as persistent.
AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
Model capability to produce useful harness edits stays constant across tiers, but capacity to actually benefit from those edits follows an inverted U-shape, peaking in mid-tier models. Weak models fail to invoke harnesses; strong models struggle with faithful instruction-following.