How does evaluation setting affect measured reasoning capabilities in language models?
This explores how the way we test reasoning — text-only vs. tool-enabled, short vs. padded inputs, familiar vs. novel instances — can change what looks like a model's 'reasoning ability,' often more than the model itself does.
This explores how the *setup* of an evaluation — not just the model — shapes what we measure as reasoning, and the corpus is unusually pointed on this: several notes argue that famous 'reasoning limits' are really measurement artifacts. The cleanest case is the so-called reasoning cliff. Does the reasoning cliff depend on how we test models? shows that models which collapse catastrophically on text-only benchmarks keep scaling when handed tools, meaning text-only tests systematically *underestimate* real capability. Are reasoning model collapses really failures of reasoning? sharpens the why: the bottleneck is often execution bandwidth — a model may know the algorithm but can't hand-simulate it for thousands of steps in plain text — and Do tools actually expand what language models can reason about? even offers a formal proof that tools unlock strategies that are impossible or impossibly verbose in text. So the same model can sit on either side of a 'cliff' depending purely on whether the evaluator allowed a calculator.
Input framing matters just as much as input *tools*. Does reasoning ability actually degrade with longer inputs? finds accuracy dropping from 92% to 68% with only 3,000 tokens of irrelevant padding — far below the context window, uncorrelated with language-modeling skill, and unfixed by chain-of-thought. The reasoning task didn't change; only the amount of surrounding text did. Pair this with Why do language models ignore information in their context?, which shows models ignoring in-context information when training priors are strong, and you get a picture where benchmark scores depend heavily on how the problem is *packaged*, not just what it asks.
A second theme is that high scores can come from the wrong source entirely. Are models actually reasoning about constraints or just defaulting conservatively? is the sharpest example: twelve of fourteen models did *worse* when constraints were removed, meaning they were scoring well by defaulting to the harder option rather than actually evaluating the constraint. Remove the crutch your eval accidentally provided, and apparent reasoning collapses. Do language models fail at reasoning due to complexity or novelty? points the same direction from novelty: models don't break at a complexity threshold but at an *unfamiliarity* boundary, so a benchmark drawn from instances near the training distribution will overstate generalization that isn't there.
There's even a layer where the evaluation format hides reasoning that *did* happen. Do transformers hide reasoning before producing filler tokens? shows models computing correct answers in early layers, then overwriting them with format-compliant filler — so a surface read of the output undersells the internal computation. The recurring lesson across all of these: a reasoning score is a joint measurement of the model *and* the harness — tool access, input padding, constraint structure, instance familiarity, and output format each move the number, sometimes by tens of points.
If you want to go further, the corpus also has the constructive flip side — designing evals and signals that measure reasoning more honestly: Can model confidence work as a reward signal for reasoning? uses the model's own answer-confidence as a training signal, Do high-entropy tokens drive reasoning model improvements? shows that the real reasoning 'decisions' live in a ~20% slice of high-entropy tokens, and Can small models match large models on function calling? shows small models matching large ones once the eval rewards the right format failures. The headline you didn't know you wanted: before asking *how well a model reasons*, ask what your test is secretly rewarding.
Sources 11 notes
Language models show catastrophic failure in text-only reasoning benchmarks but maintain scaling when given tool access. The cliff reflects execution constraints, not reasoning capability, making text-only evaluations systematically underestimate real-world performance.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Formal proof shows tool-integrated reasoning enables strategies impossible or prohibitively verbose in text alone, expanding both empirical and feasible support. The advantage spans abstract reasoning, not just arithmetic, and Advantage Shaping Policy Optimization stabilizes training without reward distortion.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.