INQUIRING LINE

Why do benchmark scores rise while reasoning quality declines?

This explores why a model can score higher on benchmarks while the actual quality of its reasoning gets worse — and the corpus shows this gap comes from at least three distinct mechanisms: contaminated tests, shortcut-rewarding training, and the fact that standard metrics only grade the final answer.


This explores why a model can score higher on benchmarks while the actual quality of its reasoning gets worse. The corpus traces the gap to three separate failure points, and they compound. The first is that the benchmark itself may be measuring memory, not thought. Qwen2.5-Math-7B can reconstruct over half of MATH-500 from partial prompts yet scores 0% on a clean post-release test — so gains attributed to 'reasoning' are partly the model recalling answers it already saw Does RLVR success on math benchmarks reflect genuine reasoning improvement?. Importantly, this doesn't mean training never works: behavioral activation of genuine reasoning and benchmark improvement are separable phenomena that can occur side by side, which is exactly why a rising score is ambiguous evidence Can genuine reasoning activation coexist with contaminated benchmarks?.

The second mechanism is that the training that lifts scores can actively hollow out reasoning. Supervised fine-tuning raises final-answer accuracy while cutting the information content of the reasoning steps by ~39% — the model arrives at correct answers through post-hoc rationalization and pattern-matching shortcuts rather than genuine inference, and becomes less auditable in the process Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The deeper reason this is invisible is methodological: most benchmarks grade only the final answer. When you score traces instead of solutions, the apparent ceiling drops — one benchmark found a 20% real ceiling that trace-based scoring would inflate by counting stylistic 'reasoning mimicry' as the real thing Should reasoning benchmarks score final answers or reasoning traces?. And mimicry is cheap: chains of thought that are logically invalid perform nearly as well as valid ones, because the model is learning the *form* of reasoning, not the inference itself Does logical validity actually drive chain-of-thought gains?.

The third mechanism is that the knobs we turn to push scores up have non-monotonic effects — more is not better past a point. Increasing thinking tokens from ~1,100 to ~16K dropped accuracy from 87% to 70%, because models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?. Optimal chain-of-thought length follows an inverted-U, and tellingly, more capable models and RL training naturally gravitate toward *shorter* chains — simplicity emerges from good reward signals, so a model padding its reasoning to look thorough is often a worse model, not a better one Why does chain of thought accuracy eventually decline with length?. Errors also snowball step by step regardless of which fancy reasoning framework you use, so the appearance of elaborate deliberation doesn't buy reliability Does the choice of reasoning framework actually matter for test-time performance?.

What ties this together — and is the part you might not expect — is *where* the real reasoning signal actually lives. Only about 20% of tokens are high-entropy 'forking points' where the model makes a genuine decision; training on just those matches full training Do high-entropy tokens drive reasoning model improvements?. Most of the visible reasoning trace is filler around a few load-bearing moments, which is why a longer, more impressive-looking trace can coexist with worse decisions at the points that matter. And the fragility is real: reasoning accuracy falls from 92% to 68% with just 3,000 tokens of irrelevant padding, far below the context limit, even with chain-of-thought prompting reasoning-performance-degrades-with-input-length-even-far-below-context-length-l.

The through-line: a benchmark score is a single number measuring the final answer, while reasoning quality lives in the steps, the decision points, and robustness to distraction. Optimize hard for the number and you can get all three forms of decay at once — memorized test items, shortcut-trained answers, and bloated traces — each of which lifts the score for reasons that have nothing to do with thinking better. If you want one provocative thread to pull, it's that the field's own benchmarks disagree on whether 'content-independent' reasoning is even the right target, since humans and LLMs fail along the same content-sensitivity axis Do language models fail reasoning tests that humans pass?.


Sources 12 notes

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Should reasoning benchmarks score final answers or reasoning traces?

LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher auditing whether a curated library's claims about reasoning-vs-benchmark decay still hold under newer models and methods. The question remains open: why do benchmark scores rise while reasoning quality declines?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2025. The library identified three compounding failure modes:
• Benchmark contamination: Qwen2.5-Math-7B reconstructs >50% of MATH-500 from partial prompts, scoring 0% on clean post-release tests (2025-07).
• SFT hollowing: supervised fine-tuning raises final-answer accuracy while cutting reasoning information content by ~39%; trace-based scoring reveals a 20% real ceiling vs. inflated final-answer metrics (2025-02, 2025-10).
• Non-monotonic scaling: reasoning accuracy drops from 87% to 70% when thinking tokens increase from ~1.1K to ~16K; inverted-U optimal CoT length emerges with stronger models preferring shorter chains (2025-02, 2025-06).
• High-entropy minority tokens: ~20% of tokens are decision-critical 'forking points'; training on those alone matches full training (2025-06).
• Padding fragility: reasoning accuracy falls 92%→68% with just 3K irrelevant tokens, far below context limits (2024-02).

Anchor papers (verify; mind their dates):
• arXiv:2507.10532 (2025-07) — data contamination and RLVR reliability
• arXiv:2502.07266 (2025-02) — inverted-U CoT length and model capability
• arXiv:2506.01939 (2025-06) — high-entropy token efficiency in RL
• arXiv:2510.18176 (2025-10) — local vs. global validity in math reasoning traces

Your task:
(1) RE-TEST EACH CONSTRAINT. For contamination, SFT hollowing, and scaling non-monotonicity: have newer evals (e.g., held-out test suites, process-based reward models, or post-training methods released in late 2025) relaxed or overturned these limits? Does trace-based scoring now dominate final-answer grading in major benchmarks? Separate the durable observation (benchmark design conflates memory with reasoning) from perishable technical limits (e.g., optimal CoT length may shift with new scaling laws).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: have any papers shown that longer reasoning or SFT can reliably improve *both* final accuracy and reasoning transparency? Name contradicting claims explicitly.
(3) Propose 2 research questions assuming the regime has moved: (a) If high-entropy token identification is now automated, can filtering+training on forking points alone scale to frontier models? (b) Do newer process-reward models or verifier-based scoring finally decouple benchmark inflation from reasoning degradation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines