Why does enlarging the evaluation unit reintroduce comparability problems?
This explores why moving from scoring single answers to scoring whole interaction trajectories (agent runs, multi-step conversations) brings back the very comparison problems benchmarks were supposed to settle, rather than escaping them.
This explores why moving from scoring single answers to scoring whole interaction trajectories brings back the very comparison problems benchmarks were supposed to settle. The clearest statement in the corpus is that interactive evaluation doesn't dissolve the hard parts of measurement — it relocates them into higher-dimensional space. Comparability, reproducibility, and the mapping from evidence to judgment don't disappear when you grade a trajectory instead of a token; they reappear at the trajectory level, now harder to pin down because there are more moving parts and no shared protocol for scoring them Do interactive evaluations actually solve the benchmark comparison problem?. A single answer is comparable because everyone is grading the same small thing. Enlarge the unit and you reintroduce the question of *which* part of the trajectory you're even comparing.
Why the enlargement specifically breaks comparability becomes clearer once you look at how the corpus characterizes what lives inside a longer unit. Reasoning quality turns out to depend less on overall length and more on internal structure — the fraction of steps spent in abandoned, failed branches predicts correctness better than trace length or review ratio, partly because those dead branches linger in context and bias what comes after Does failed-step fraction predict reasoning quality better?. A bigger unit is not just a longer version of a small one; it contains internal failures, detours, and dependencies that a final-answer score collapses into a single bit. Two trajectories that reach the same answer can be wildly different objects, so comparing them "as units" smuggles in a choice about what counts.
The same problem shows up in what length itself means once you scale up. Trace length reflects how close a problem sits to the training distribution, not its actual difficulty — the correlation holds in-distribution and decouples entirely out-of-distribution Does longer reasoning actually mean harder problems?. So a metric that seems comparable across a benchmark (longer = harder) silently stops being comparable the moment instances vary in familiarity. Relatedly, failures cluster at instance-level novelty rather than at clean task-complexity thresholds Do language models fail at reasoning due to complexity or novelty?, which means the larger unit's score depends on a hidden variable — how familiar this exact instance is — that a per-answer benchmark could hold roughly constant but a trajectory benchmark cannot.
There's also a finer-grained answer hiding in the corpus: people have already found that the fix for noisy aggregate scores is to *shrink* the unit, not grow it. Step-level confidence filtering beats global confidence averaging because local signals catch reasoning breakdowns that whole-trace averaging masks Does step-level confidence outperform global averaging for trace filtering?, and step-level critique during training preserves solution diversity that coarse outcome rewards would wash out Do critique models improve diversity during training itself?. Read together with the relocation finding, the lesson is symmetrical: enlarging the evaluation unit reintroduces comparability problems for the same reason coarse aggregation hides reasoning failures — averaging over a bigger, heterogeneous span destroys the apples-to-apples alignment that made small-unit scoring trustworthy.
The thing you didn't know you wanted to know: the real cost isn't dimensionality, it's the missing protocol. The corpus's prescription is that trajectory evaluation needs shared design standards and explicit evidence-to-judgment rules to become interpretable again Do interactive evaluations actually solve the benchmark comparison problem? — comparability was never a free property of the metric, it was something the small fixed-answer format quietly provided for you. Scale the unit up and you have to rebuild it on purpose.
Sources 6 notes
Interactive evaluation relocates core problems—comparability, reproducibility, evidence-to-judgment mapping—into higher-dimensional space rather than solving them. The field needs design protocols and shared standards, not format adoption, to make trajectory scoring interpretable.
Across 10 reasoning models, the fraction of steps in abandoned branches consistently predicts correctness better than CoT length or review ratio. Failed branches persist in context and bias subsequent reasoning, a phenomenon confirmed through correlation, reranking, and direct causal editing.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.