INQUIRING LINE

Why do more capable models prefer shorter chains of thought?

This explores why stronger models tend to reason in shorter chains — and whether that brevity is a sign of skill, a quirk of training rewards, or a hint that the visible reasoning was never doing the work we assumed.


This explores why stronger models tend to reason in shorter chains — and the corpus turns it into a more interesting puzzle than a simple "smarter = more efficient" story. The most direct answer is that there's a sweet spot: accuracy follows an inverted-U as reasoning gets longer, peaking at some intermediate length and then declining. That optimal length stretches out for harder tasks but shrinks as the model gets more capable, so a stronger model simply needs fewer steps to land in its accuracy peak Why does chain of thought accuracy eventually decline with length?. Crucially, nobody trains the model to be terse — reinforcement learning drifts toward shorter chains on its own as the model improves, meaning brevity emerges from the reward signal rather than being explicitly taught.

What's striking is how much of a long chain turns out to be non-computational. One study strips reasoning down to minimal drafts and matches full chain-of-thought accuracy using just 7.6% of the tokens — the other 92% served style and documentation, not the actual computation Can minimal reasoning chains match full explanations?. And verbosity itself seems to be a single steerable direction in the model's activation space: a vector pulled from 50 examples cuts chain length by two-thirds without hurting accuracy Can we steer reasoning toward brevity without retraining?. If conciseness lives on one dial, a capable model converging toward it looks less like a discovery and more like settling into a region it can already represent.

The overthinking penalty is real, not just wasteful. Push thinking tokens from ~1,100 to ~16K and benchmark accuracy can collapse from 87% to 70% — models overthink easy problems and the extra deliberation actively corrodes correct answers Does more thinking time always improve reasoning accuracy?. There's a mechanistic flavor to this: untrained models use extended thinking counterproductively, talking themselves into self-doubt, and RL training reverses that — turning the same machinery from second-guessing into useful gap analysis Does extended thinking help or hurt model reasoning?. So a capable model's short chain may reflect that it no longer needs to argue itself out of a corner.

Here's the part you didn't know you wanted to know: trace length may not be measuring difficulty at all. Controlled maze experiments show chain length tracks problem difficulty only when problems resemble training data — out of distribution, the correlation vanishes entirely. Length mostly reflects how well the model is recalling a familiar schema, not how much it's adaptively computing Does longer reasoning actually mean harder problems?. A capable model produces short chains partly because more of the world looks familiar to it. This reframes the whole question — and it gets sharper alongside evidence that fine-tuning makes reasoning steps less causally connected to the final answer, so the chain becomes performative rather than functional Does fine-tuning disconnect reasoning steps from final answers?, and that models can scale reasoning entirely in latent space without verbalizing anything, suggesting the written-out chain is a training artifact rather than a requirement Can models reason without generating visible thinking tokens?.

For the adjacent territory: rather than always reasoning short or long, models can learn to route — picking extended thinking or a direct answer per problem without difficulty labels Can models learn when to think versus respond quickly?. And brevity isn't free everywhere: longer chains do measurably dampen sensitivity to noisy inputs (though never to zero) Can longer reasoning chains eliminate model sensitivity to input noise?, and in multimodal perception tasks verbose reasoning actively hurts because it optimizes the wrong bottleneck entirely Does verbose chain-of-thought actually help multimodal perception tasks?. Taken together, the corpus suggests "more capable models prefer shorter chains" is real but the cause is layered — part genuine efficiency, part reward-driven convergence, and part a clue that chain length was telling us about familiarity and presentation more than about thinking itself.


Sources 11 notes

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does extended thinking help or hurt model reasoning?

Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can models learn when to think versus respond quickly?

Thinkless trains a single model to select between extended reasoning and direct responses using DeGRPO, which decouples mode selection from answer refinement. This prevents mode collapse and enables self-calibrated routing without explicit difficulty labels.

Can longer reasoning chains eliminate model sensitivity to input noise?

Lipschitz continuity analysis proves that while additional reasoning steps reduce perturbation propagation, a non-zero robustness floor exists structurally. Sensitivity decreases with stronger embedding and hidden state norms but never reaches zero.

Does verbose chain-of-thought actually help multimodal perception tasks?

Long rationales and text-token RL help reasoning but hurt fine-grained perception tasks because the actual bottleneck is visual attention allocation, not verbalization. Standard CoT optimization trains the wrong policy target.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about chain-of-thought reasoning in LLMs. The question: Why do more capable models prefer shorter chains of thought — and is this preference robust across domains, training regimes, and model architectures?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2025.
• Accuracy follows an inverted-U with chain length; optimal length shrinks as model capability grows, and RL training drifts toward brevity without explicit pressure (2024–2025).
• 92% of tokens in verbose chains serve style, not computation; conciseness occupies a single steerable activation direction (2024–2025).
• Overthinking degrades accuracy: pushing reasoning tokens from ~1.1K to ~16K collapses performance from 87% to 70% on benchmarks (2025).
• Chain length correlates with problem difficulty only in-distribution; out-of-distribution, the link vanishes — length mostly reflects schema recall, not adaptive computation (2025).
• Fine-tuning weakens causal connection between reasoning steps and answers; latent reasoning scales without verbalization, suggesting written chains are training artifacts (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2406.06580 (2024): Break the Chain — shortcut reasoning
• arXiv:2411.15382 (2024): Fine-Tuning Impact on CoT Reasoning
• arXiv:2508.01191 (2025): Chain-of-Thought as Data Distribution Mirage
• arXiv:2509.07339 (2025): Performative Thinking — brittle CoT-complexity link

Your task:
(1) RE-TEST the distribution-dependency claim and the inverted-U constraint. Have new evaluations (especially on harder, out-of-distribution benchmarks or newer model families) confirmed that brevity preference holds only in-domain? Does it persist across multimodal or long-context settings? Separate the durable finding (capable models *can* reason briefly) from the perishable claim (they *always prefer* to).
(2) Surface the strongest CONTRADICTING work from the last 6 months — any paper showing extended thinking *always* helps, or that chain length is a poor proxy for model capability, or that fine-tuning *restores* faithfulness.
(3) Propose 2 research questions that assume the regime may have shifted: (a) If latent reasoning now scales test-time compute without verbalization, does the preference for short *written* chains become irrelevant to understanding model reasoning? (b) Can adaptive routing (picking chain length per problem) now outperform fixed-length strategies across diverse domains?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines