INQUIRING LINE

Why do correct reasoning traces in language models tend to be shorter?

This explores why, in reasoning models, the correct answer usually arrives in fewer tokens — and what that says about whether longer 'thinking' actually means better thinking.


This explores why correct reasoning traces tend to be shorter, and the corpus points to a counterintuitive answer: extra length is often a symptom of trouble, not a sign of effort. Across QwQ, DeepSeek-R1, and LIMO, correct solutions simply use fewer tokens than incorrect ones, because longer traces correlate with more self-revisions — and those revisions tend to introduce and compound errors rather than fix them Why do correct reasoning traces contain fewer tokens?. The model isn't reasoning its way toward truth as it rambles; it's flailing. When a problem is genuinely within reach, the model lands it quickly.

A wider view of chain-of-thought sharpens this. Optimal CoT length follows an inverted-U: accuracy peaks at some intermediate length and then declines, and crucially, more capable models prefer *shorter* chains. RL training naturally drifts toward brevity as models improve — simplicity emerges from the reward signal, not from being told to be concise Why does chain of thought accuracy eventually decline with length?. So 'shorter when correct' isn't a quirk of one model family; it's what competence looks like.

The deeper reason is that most of the tokens in a verbose trace aren't doing computational work. Chain of Draft matches full chain-of-thought accuracy using just 7.6% of the tokens — the other 92% served style and documentation, not the actual reasoning Can minimal reasoning chains match full explanations?. When you rank tokens by functional importance, the symbolic-computation tokens get preserved while grammar and meta-discourse are pruned first, and students trained on the pruned chains do *better* Which tokens in reasoning chains actually matter most?. If the load-bearing reasoning is a small fraction of the text, then a long trace is mostly padding — and padding is where errors hide.

There's also a darker reading lurking in the corpus: that the traces aren't really the reasoning at all. Deliberately corrupted traces teach as well as correct ones, and invalid logical steps perform nearly as well as valid ones — suggesting traces work as computational scaffolding and stylistic mimicry rather than verified thought Do reasoning traces need to be semantically correct? Do reasoning traces show how models actually think?. If the visible trace is partly performance, then more of it just means more surface area for the model to wander, second-guess, and drift away from the answer it may have already computed internally Do transformers hide reasoning before producing filler tokens?.

Finally, length actively hurts. Reasoning accuracy drops from 92% to 68% with just 3,000 tokens of padding, far below the context window — the degradation is task-agnostic and persists even with CoT prompting Does reasoning ability actually degrade with longer inputs?. And local memorization, based on immediately preceding tokens, accounts for up to 67% of reasoning errors, getting worse as chains grow Where do memorization errors arise in chain-of-thought reasoning?. Put together, the picture isn't 'short causes correct' so much as a feedback loop: the model that's on track finishes fast, while the model that's lost keeps talking — and every extra token is another chance to talk itself out of the right answer.


Sources 9 notes

Why do correct reasoning traces contain fewer tokens?

Across QwQ, DeepSeek-R1, and LIMO, correct solutions average fewer tokens than incorrect ones. Longer traces correlate with more self-revisions, which introduce and compound errors rather than improve reasoning quality.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, assess whether the constraint 'correct reasoning traces are shorter' still holds, or whether newer models, training methods, or evaluation have relaxed it.

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Apr 2026.
• Correct reasoning traces in QwQ, DeepSeek-R1, LIMO use fewer tokens than incorrect ones; length correlates with self-revision cycles that compound rather than fix errors (~2025).
• Optimal CoT length follows an inverted-U; accuracy peaks at intermediate length then declines; more capable models prefer *shorter* chains (~2025).
• Chain of Draft matches full CoT accuracy using 7.6% of tokens; 92% of verbose traces is style/documentation, not computation (~2025).
• Deliberately corrupted traces teach as well as correct ones; invalid logical steps perform nearly as well as valid ones, suggesting traces are scaffolding/mimicry, not verified thought (~2025).
• Reasoning accuracy drops from 92% to 68% with just 3,000 tokens of padding; local memorization (preceding tokens) accounts for up to 67% of reasoning errors, worsening with chain length (~2026).

Anchor papers (verify; mind their dates):
• arXiv:2502.07266 (Feb 2025) — When More is Less: Understanding Chain-of-Thought Length in LLMs
• arXiv:2508.02037 (Aug 2025) — Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
• arXiv:2604.15726 (Apr 2026) — LLM Reasoning Is Latent, Not the Chain of Thought
• arXiv:2508.01191 (Aug 2025) — Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, judge whether newer models (o1, QwQ-32B, Claude 4, Gemini 2), test-time scaling (compute-optimal inference), RL-refined trace synthesis, or mechanistic evaluation (circuit analysis, causal intervention) have since RELAXED or OVERTURNED it. Separate the durable question ('why does verbosity risk error?') from the perishable claim ('short traces are always correct'). Cite what resolved each constraint, and flag which still appear to hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — papers claiming long traces *do* improve reasoning, or that trace length correlates differently under adversarial setups, distribution shift, or new training regimes.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., 'Under what *inference cost budget* do longer test-time traces become adaptive?' or 'Do traces trained on synthetic long-horizon reasoning (proof search, tree search) escape the brevity penalty?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines