INQUIRING LINE

What capability dimensions does a single aggregate pass rate hide?

This explores what a single overall score (the percentage of tasks an AI gets right) flattens out — the separate, often conflicting capabilities that hide underneath one number.


This explores what a single overall pass rate flattens out — and the corpus is unusually direct that one number is the wrong unit of measurement. The cleanest statement is that capability isn't a scalar at all but a vector: it decomposes into at least five separable axes — task success, privacy compliance, long-horizon retention, behavior under mode shifts, and ecosystem readiness — and models that top one axis routinely rank low on another Does a single benchmark score actually predict agent readiness?. A single aggregate collapses that vector to its first component and quietly drops the rest, which is exactly why one score 'systematically misleads' about real deployment.

The most dangerous thing a pass rate hides is *where* the failures live. Aggregate accuracy can look strong while errors concentrate in the rare, high-harm cases — medical triage, legal interpretation, financial planning — where fluent, confident wrong answers slip through because the average drowns them out Why do confident wrong answers hide in standard accuracy metrics?. A pass rate treats a confident error and an honest miss as the same lost point, so it can't tell you that the misses cluster precisely where they cost the most.

A pass rate also hides that different skills scale at different rates. When you decompose performance into distinct competencies, logical reasoning keeps climbing with model size while style and metacognition saturate early — so two models with the same aggregate score can be strong in opposite places, and a distilled model can imitate surface style while failing at substance Do all AI skills improve equally as models scale?. Worse, identical metrics can sit on top of fundamentally different internal organization: a model can have every linearly-decodable feature a task needs yet carry a fractured representation that shatters under perturbation or distribution shift — invisible to standard accuracy Can models be smart without organized internal structure?.

The number also hides *how* the answer was reached. Two correct traces aren't equal — failed-step fraction (how much of the reasoning wandered into abandoned branches) predicts correctness better than length or review ratio, and those dead branches linger in context and bias what comes next Does failed-step fraction predict reasoning quality better?. In the same spirit, step-level confidence catches reasoning breakdowns that global averaging masks, so a clean final score can sit atop a process that was quietly broken Does step-level confidence outperform global averaging for trace filtering?.

Finally, the aggregate hides the boundary of what it can even measure. Auto-gradable benchmarks both overstate and understate ability by privileging precisely-specified tasks, and only open-world evaluation of messy, long-horizon work — with cost reported — recovers the distortion Do automated benchmarks hide what frontier AI systems can really do?. A frontier exam can discriminate today yet say nothing about autonomous research or open-ended problem-solving Can frontier exams really measure cutting-edge AI capability?. The throughline: a pass rate answers 'how often,' but deployment turns on where it fails, how it got there, whether the inside is sound, and whether the test could see the capability at all.


Sources 8 notes

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

Why do confident wrong answers hide in standard accuracy metrics?

Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.

Do all AI skills improve equally as models scale?

FLASK's 12-skill decomposition reveals metacognition saturates at 7B parameters while logical efficiency plateaus at 30B, but reasoning and knowledge skills improve continuously. Open-source models successfully imitate surface-level style but fail at reasoning—confirming that distillation copies form not substance.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does failed-step fraction predict reasoning quality better?

Across 10 reasoning models, the fraction of steps in abandoned branches consistently predicts correctness better than CoT length or review ratio. Failed branches persist in context and bias subsequent reasoning, a phenomenon confirmed through correlation, reranking, and direct causal editing.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Do automated benchmarks hide what frontier AI systems can really do?

Automated benchmarks both overstate and understate capability by privileging precisely-specified, auto-gradable tasks. Open-world evaluations of long-horizon messy tasks through qualitative log analysis—with cost explicitly reported—correct these distortions and catch emerging capabilities earlier.

Can frontier exams really measure cutting-edge AI capability?

Humanity's Last Exam uses 3,000 expert-designed questions to expose capability gaps where MMLU saturates, showing real discrimination—but expert exam performance wouldn't indicate autonomous research or open-world problem-solving that matters for deployment.

Next inquiring lines