Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
A counterintuitive empirical finding: when comparing correct vs. incorrect solutions to the same questions in o1-like models (QwQ, DeepSeek-R1, LIMO), the correct solutions are systematically shorter. More tokens correlate with wrongness, not rightness.
This directly challenges the "longer = better" narrative underlying much of the test-time scaling literature. If scaling compute leads to longer traces, and longer traces are more likely to be incorrect, then compute scaling via trace extension is actively selecting for worse outputs.
The explanation: longer CoTs contain more self-revisions (see Does self-revision actually improve reasoning in language models?). The model overshoots, revises, introduces errors, and compounds them through revision chains. A model that gets to the right answer quickly does so because it's reasoning correctly, not because it failed to second-guess itself.
The practical implication is that trace length is a poor quality signal — and that training/inference strategies optimizing for longer traces may be optimizing in the wrong direction.
The LLM Strategic Reasoning paper (behavioral game theory evaluation of 22 LLMs) provides independent cross-domain confirmation. In competitive games, top performers (GPT-o1, DeepSeek-R1) produce the shortest CoT within their strongest games. DeepSeek-R1 in competitive games exhibits "repeated self-doubt in its CoT" that creates redundant reasoning loops inflating token usage without improvement. The pattern extends beyond math and coding to strategic interaction: across game types, longer chains signal hesitation and uncertainty, not deeper insight. See Do large language models use one reasoning style or many?.
GaslightingBench-R adds a further dimension: manipulative multi-turn prompts exploit exactly this vulnerability. By introducing misleading content into the chain, adversarial prompts extend the reasoning trace through corrupted steps. The model's own reasoning then elaborates those corrupted steps into longer wrong answers. The same length-wrongness correlation holds, but now as a designed attack surface: longer chains are more exposed to manipulation because there are more points of intervention. Why do reasoning models fail under manipulative prompts? documents this adversarial dimension.
Inquiring lines that use this note as a source 25
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Are correct reasoning traces measurably shorter than incorrect ones?
- Why do top performers produce shorter chains of thought in their strongest domains?
- What linguistic markers distinguish longer incorrect traces from correct ones?
- Why do correct reasoning traces in language models tend to be shorter?
- Why do correct reasoning traces appear shorter than incorrect ones?
- Why do shorter correct reasoning traces contain fewer failed branches?
- Why are correct reasoning traces consistently shorter than incorrect ones?
- Do shorter reasoning traces actually produce more reliable model outputs?
- Why do correct reasoning traces tend to be shorter than incorrect ones?
- Why do corrupted traces maintain performance as well as correct traces?
- Which sentences in reasoning traces actually influence the final answer?
- How does trace coherence differ from valid mathematical proof in practice?
- How does trace coherence differ from trace validity in reasoning?
- Do correct reasoning traces tend to be shorter than incorrect ones?
- Do shorter correct reasoning traces contain more thought anchors than longer ones?
- Why does failed step fraction predict reasoning quality better than trace length?
- Why do correct reasoning traces stay shorter than incorrect ones?
- Why are incorrect reasoning traces longer than correct ones?
- Why does concise reasoning maintain accuracy with far fewer tokens?
- Does trace length actually reflect problem difficulty or training proximity?
- How much of a reasoning trace is actually redundant or unnecessary?
- What makes thinking tokens carry more information than other tokens?
- What makes a thinking trace take information shortcuts?
- What makes reasoning traces effective or ineffective for solving problems?
- Why are shorter reasoning traces more reliable than longer correct ones?
Related concepts in this collection 9
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
the mechanism behind this finding
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the broader overthinking phenomenon
-
Do hedging markers actually signal careful thinking in AI?
Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
supporting evidence from linguistic analysis
-
Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
adversarial exploitation of the length-wrongness correlation
-
Do large language models use one reasoning style or many?
Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
independent confirmation from game theory: leaders produce shortest CoT in their strongest games
-
Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
the inverse-length pattern breaks in social reasoning
-
Can we measure how deeply a model actually reasons?
What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
DTR explains the mechanism: correct traces have higher proportion of deep-thinking tokens (genuine computation) with less low-DTR padding: models produce longer traces for ToM than factual questions yet effort is uncorrelated with accuracy, suggesting the shorter=correct heuristic is domain-specific to formal reasoning
-
Does longer reasoning actually mean harder problems?
Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.
refines the mechanism further: trace length reflects how close the prompt is to training distribution, not how hard the problem is; "correct = shorter" partly recodes "in-distribution = shorter = more likely correct" — the length signal is a training-proximity signal, not purely a reasoning-quality signal
-
Does every correct chain-of-thought trace improve fine-tuning?
Are all answer-correct reasoning traces equally valuable for training? This explores whether some correct traces contain reasoning that actually harms model learning despite reaching the right answer.
identifies the causal lever behind the shorter=correct correlation: the harmful segment is post-conclusion continuation, and deleting it at training time improves SFT (not length per se)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces
- Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Do Cognitively Interpretable Reasoning Traces Improve LLM Performance?
Original note title
correct reasoning traces in o1-like models are shorter than incorrect ones