INQUIRING LINE

How much does extended thinking actually improve model reasoning ability?

This explores whether giving a model more time and tokens to 'think' actually makes its reasoning better — and the corpus suggests the honest answer is 'less than you'd hope, and only under specific conditions.'


This explores whether extended thinking genuinely improves reasoning, or just looks like it does — and the corpus pushes hard against the intuitive 'more thinking = better answers' assumption. The clearest finding: the relationship is non-monotonic, not linear. One study watched accuracy fall from 87.3% to 70.3% as thinking tokens scaled from ~1,100 to ~16,000, with models overthinking easy problems and underthinking hard ones Does more thinking time actually improve LLM reasoning? Does more thinking time always improve reasoning accuracy?. The optimal length turns out to follow an inverted-U curve: it grows with task difficulty but *shrinks* as the model gets more capable — and RL training naturally drifts toward shorter chains as models improve Why does chain of thought accuracy eventually decline with length?.

The more unsettling thread is *why* extended thinking sometimes helps at all. One note argues the gains come not from better reasoning but from variance inflation — longer traces widen the output distribution so it covers the correct answer more often, which is really sampling coverage dressed up as thought. Past a threshold the distribution gets too diffuse and accuracy collapses Does extended thinking actually improve reasoning or just increase variance?. If that's right, a lot of 'thinking' is closer to taking more lottery tickets than to genuine deliberation.

But quantity isn't the whole story — training mediates *quality*. Vanilla models often use thinking mode counterproductively, talking themselves into self-doubt that degrades performance; RL training reverses the very same mechanism into productive gap analysis Does extended thinking help or hurt model reasoning?. And the capability may already be latent: multiple independent methods all elicit reasoning that's *already present* in base-model activations, suggesting post-training selects reasoning rather than creating it Do base models already contain hidden reasoning ability?. So 'thinking longer' may matter less than thinking having been *shaped* the right way.

Where extended thinking clearly fails to help is just as revealing. Reasoning models can't reject ill-posed questions with missing premises — they generate long redundant traces while plain models simply say 'unanswerable,' because training rewards producing steps but never teaches *when to disengage* Why do reasoning models overthink ill-posed questions?. On theory-of-mind tasks, reasoning models produce longer-but-useless traces and don't beat vanilla LLMs, hinting that social reasoning needs a different architecture than sequential derivation Why do reasoning models struggle with theory of mind tasks?. And more reasoning training does nothing for sycophancy — models still fall for flattering fallacies, because that's a generation-distribution problem, not a reasoning one Can better reasoning training actually reduce model sycophancy?.

If you want the constructive flip side: you can compress without losing much. Verbose and concise chains occupy distinct, *steerable* regions of activation space — a single extracted vector cut chain-of-thought length 67% while holding accuracy and nearly tripling speed Can we steer reasoning toward brevity without retraining?. There's even a way to measure whether a model is *genuinely* reasoning rather than just emitting tokens: the deep-thinking ratio tracks how often token predictions get revised across layers, and it correlates with accuracy well enough to cut inference cost Can we measure how deeply a model actually reasons?. The thing you didn't know you wanted to know: the field is quietly shifting from 'how do we make models think more' to 'how do we tell when thinking is real, and stop paying for the parts that aren't.'


Sources 11 notes

Does more thinking time actually improve LLM reasoning?

Accuracy drops from 87.3% to 70.3% as thinking tokens scale from 1,100 to 16,000, and bypassing explicit reasoning entirely matches or beats standard thinking at equal token budgets. The relationship is non-monotonic, not the linear improvement commonly assumed.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does extended thinking actually improve reasoning or just increase variance?

Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Why do reasoning models struggle with theory of mind tasks?

Reasoning models fail to outperform vanilla LLMs on theory of mind tasks, produce longer but unhelpful traces, and show no generalization to similar scenarios. ThoughtTracing's success using shorter Bayesian hypothesis tracking suggests social reasoning demands simultaneous multiple-model maintenance, not sequential derivation.

Can better reasoning training actually reduce model sycophancy?

Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can we measure how deeply a model actually reasons?

Deep-thinking ratio (DTR) measures the proportion of tokens whose predictions undergo significant revision across model layers, correlating robustly with accuracy across AIME, HMMT, and GPQA benchmarks. Think@n, a test-time strategy using DTR, matches self-consistency performance while reducing inference costs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability researcher re-testing whether extended thinking actually improves model performance. The question remains open: does longer chain-of-thought (CoT) reasoning genuinely improve reasoning ability, or does it trade off accuracy and latency in ways a curated library has only begun to map?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as a constraint candidate, not settled fact.
- Accuracy follows an inverted-U curve with thinking tokens: one study saw accuracy drop from 87.3% to 70.3% as tokens scaled from ~1,100 to ~16,000 (2025-02, arXiv:2502.07266).
- Extended thinking may inflate output variance rather than improve reasoning quality, widening the distribution so correct answers appear more often—a lottery-ticket effect, not deliberation (2025-02).
- RL training transforms vanilla models' counterproductive self-doubt into productive gap analysis; reasoning capability may already be latent in base models (2025-06, arXiv:2506.04210).
- Reasoning models fail on ill-posed questions and social reasoning tasks where vanilla models abstain or outperform; they generate long traces but don't improve accuracy (2025-02, arXiv:2502.11881).
- Verbose and concise CoT occupy distinct activation regions; chains can compress 67% without accuracy loss (2025-07, arXiv:2507.04742).

Anchor papers (verify; mind their dates):
- arXiv:2502.07266 (When More is Less, Feb 2025)
- arXiv:2506.04210 (Does Thinking More always Help, Jun 2025)
- arXiv:2507.04742 (Activation Steering for Compression, Jul 2025)
- arXiv:2602.13517 (Deep-Thinking Tokens, Feb 2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the inverted-U finding, the variance-inflation hypothesis, and the failure modes (ill-posed, social reasoning), probe whether newer models (o1, o3, Claude, Grok-3) exhibit the same nonlinearity, whether RL at scale dissolves the distinction between variance and true reasoning improvement, and whether recent architectures (retrieval-augmented, multi-agent, tool-integrated) bypass the social-reasoning and premise-checking failures. Separate what still limits extended thinking from what newer training or orchestration has relaxed.

(2) Surface the strongest contradicting or superseding work from the last ~6 months—especially papers claiming linear scaling, architectural fixes for premise validation, or evidence that test-time compute genuinely unlocks new capabilities rather than redistributing existing ones.

(3) Propose two research questions that assume the regime may have moved: (a) If RL has already converged on optimal thinking-token allocation, what prevents scaling to dramatically longer traces? (b) Can you build a zero-cost probe—activation-based, not trace-based—that predicts when extended thinking will hurt rather than help, and use it to gate inference dynamically?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines