INQUIRING LINE

How does evaluation format change what we measure about model reasoning?

This explores how the *setup* of an evaluation — the output format we allow, the length budget, whether tools are on the table, and who's doing the grading — quietly decides what we're actually measuring when we say a model 'reasons.'


This explores how the way we frame an evaluation changes what 'reasoning' we end up scoring — and the corpus suggests the format often measures something other than what we think. The starkest case: when a model is confined to text-only generation and we watch it fail a long multi-step problem, we tend to call that a reasoning collapse. But give the same model a tool to execute the steps and the 'collapse' vanishes — it knew the algorithm all along and was bottlenecked on procedural execution bandwidth, not thinking Are reasoning model collapses really failures of reasoning?. The format (text-only vs. tool-enabled) didn't just change the score; it changed what the score was a measurement *of*.

A second thread shows that the visible form of reasoning can come apart from its substance. Chains of thought built from logically *invalid* steps perform nearly as well as valid ones — the model is picking up the shape of reasoning, not genuine inference Does logical validity actually drive chain-of-thought gains?. In the same spirit, a 1.5B model fine-tuned only on output *format* matches much larger RL-trained models, implying that a lot of what reasoning benchmarks reward is the organization of the answer rather than new knowledge Can small models reason well by just learning output format?. So an eval that scores final correctness on formatted traces may be measuring presentation as much as cognition.

Length is its own format knob, and it bends the curve in counterintuitive ways. Accuracy doesn't rise monotonically with more thinking tokens — push from ~1,100 to ~16K tokens and accuracy can fall from 87% to 70% as models overthink easy items and thrash on hard ones Does more thinking time always improve reasoning accuracy?. Optimal chain length actually traces an inverted U that shifts with both task difficulty and model capability, so a fixed token budget rewards different models differently Why does chain of thought accuracy eventually decline with length?. Some of that 'wasted' length is structural: models wander into invalid branches and abandon promising paths prematurely, and a decoding-only penalty on thought-switching recovers accuracy without any retraining Why do reasoning models abandon promising solution paths?Do reasoning models switch between ideas too frequently?. Whether you read these as reasoning failures or as artifacts of how the generation format lets the model meander depends entirely on the evaluation frame.

This is why some researchers stop measuring at the output surface and look inward instead. The deep-thinking ratio tracks how often a token's prediction gets revised across the model's layers, giving a signal of *genuine* reasoning effort that correlates with accuracy without trusting the visible trace Can we measure how deeply a model actually reasons?. Others note the same RL-trained 'thinking mode' that produces self-doubt in a vanilla model becomes productive gap-analysis after training — same mechanism, opposite measured value, depending on what shaped it Does extended thinking help or hurt model reasoning?.

The twist worth carrying away: the *judge's* format changes the measurement too, not just the model's. Reward models that are allowed to reason step-by-step before scoring — rather than emit a single classification — judge more accurately and scale with test-time compute Can reward models benefit from reasoning before scoring?, and generative judges that meta-reason about each reasoning step beat discriminative classifiers with orders of magnitude less data Can judges that reason about reasoning outperform classifier rewards?. You can even use the model's own answer-span confidence as the grading signal, which strengthens reasoning while fixing calibration Can model confidence work as a reward signal for reasoning?. So 'evaluation format' isn't only the box the model writes in — it's also the lens the grader looks through, and changing either one changes the thing you thought you were measuring.


Sources 12 notes

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can small models reason well by just learning output format?

A 1.5B parameter model with LoRA-only post-training matched larger full-parameter RL models on reasoning tasks, suggesting RL teaches output format organization rather than new factual knowledge. This efficiency indicates reasoning and knowledge storage are separable capabilities.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning evaluation researcher. The question remains open: **Does evaluation format measure reasoning itself, or artifacts of how models generate and are judged?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable constraints to re-test.

• Text-only eval format conflates reasoning with execution bandwidth: tool-enabled versions of the same task show 'collapsed' reasoning was procedural bottleneck, not cognition (2025).
• Logically invalid chain-of-thought traces score nearly as well as valid ones; models learn the *shape* of reasoning, not inference substance (2023).
• Chain-of-thought length follows an inverted-U curve peaking ~1,100–4K tokens; beyond that, accuracy degrades (87% → 70% at 16K); optimal length shifts per task and model capability (2025).
• Deep-thinking ratio (layer-wise token prediction revision) correlates with accuracy and bypasses reliance on visible trace fidelity (2026).
• Reward models permitted to reason step-by-step before scoring judge more accurately and scale with test-time compute; generative stepwise judges outperform discriminative classifiers (2025).

Anchor papers (verify; mind their dates):
• arXiv:2307.10573 (2023) — Invalid Logic, Equivalent Gains
• arXiv:2502.07266 (2025) — When More is Less: CoT Length
• arXiv:2602.13517 (2026) — Deep-Thinking Ratio
• arXiv:2508.19229 (2025) — StepWiser: Generative Judges

Your task:

(1) **RE-TEST EACH CONSTRAINT.** For every finding above, ask: Have newer models (o3, GPT-4.5, Gemini 3), decoding strategies (speculative, branching, adaptive budget), or training methods (DPO, outcome-supervised RL) since RELAXED or OVERTURNED it? Separate the durable question (format sensitivity likely persists) from perishable claims (e.g., does the inverted-U still hold?). Cite what resolved it.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last 6 months: any papers showing format *doesn't* substantially change measured reasoning, or proving reasoning is format-invariant?

(3) **Propose 2 research questions** that assume the regime has shifted: (a) If format effects remain but are now understood, how do we design *format-robust* reasoning evals? (b) Can a single unified internal signal (like deep-thinking ratio) replace format-dependent surface metrics?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines