SYNTHESIS NOTE

Why do correct reasoning traces contain fewer tokens?

In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

A counterintuitive empirical finding: when comparing correct vs. incorrect solutions to the same questions in o1-like models (QwQ, DeepSeek-R1, LIMO), the correct solutions are systematically shorter. More tokens correlate with wrongness, not rightness.

This directly challenges the "longer = better" narrative underlying much of the test-time scaling literature. If scaling compute leads to longer traces, and longer traces are more likely to be incorrect, then compute scaling via trace extension is actively selecting for worse outputs.

The explanation: longer CoTs contain more self-revisions (see Does self-revision actually improve reasoning in language models?). The model overshoots, revises, introduces errors, and compounds them through revision chains. A model that gets to the right answer quickly does so because it's reasoning correctly, not because it failed to second-guess itself.

The practical implication is that trace length is a poor quality signal — and that training/inference strategies optimizing for longer traces may be optimizing in the wrong direction.

The LLM Strategic Reasoning paper (behavioral game theory evaluation of 22 LLMs) provides independent cross-domain confirmation. In competitive games, top performers (GPT-o1, DeepSeek-R1) produce the shortest CoT within their strongest games. DeepSeek-R1 in competitive games exhibits "repeated self-doubt in its CoT" that creates redundant reasoning loops inflating token usage without improvement. The pattern extends beyond math and coding to strategic interaction: across game types, longer chains signal hesitation and uncertainty, not deeper insight. See Do large language models use one reasoning style or many?.

GaslightingBench-R adds a further dimension: manipulative multi-turn prompts exploit exactly this vulnerability. By introducing misleading content into the chain, adversarial prompts extend the reasoning trace through corrupted steps. The model's own reasoning then elaborates those corrupted steps into longer wrong answers. The same length-wrongness correlation holds, but now as a designed attack surface: longer chains are more exposed to manipulation because there are more points of intervention. Why do reasoning models fail under manipulative prompts? documents this adversarial dimension.

Inquiring lines that read this note 25

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do correct reasoning traces tend to be shorter than incorrect ones?

Do corrupted reasoning traces serve as effective supervision signals?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

When do additional thinking tokens stop improving reasoning performance?

What makes thinking tokens carry more information than other tokens?

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

23 direct connections · 215 in 2-hop network ·dense cluster Open in graph ↗

Why do correct reasoning traces contain fewer to… Does self-revision actually improve reasoning in l… Does more thinking time always improve reasoning a… Do hedging markers actually signal careful thinkin… Why do reasoning models fail under manipulative pr… Do large language models use one reasoning style o… Why do reasoning models struggle with theory of mi… Can we measure how deeply a model actually reasons… Does longer reasoning actually mean harder problem…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
the mechanism behind this finding
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
the broader overthinking phenomenon
Do hedging markers actually signal careful thinking in AI? Explores whether linguistic markers like "alternatively" and "however" in model outputs correlate with accuracy or uncertainty. This matters because users often interpret such language as a sign of trustworthy reasoning.
supporting evidence from linguistic analysis
Why do reasoning models fail under manipulative prompts? Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
adversarial exploitation of the length-wrongness correlation
Do large language models use one reasoning style or many? Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
independent confirmation from game theory: leaders produce shortest CoT in their strongest games
Why do reasoning models struggle with theory of mind tasks? Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
the inverse-length pattern breaks in social reasoning
Can we measure how deeply a model actually reasons? What if reasoning quality isn't about length or confidence, but about how much a model's predictions shift across its internal layers? Can tracking these shifts reveal genuine thinking versus pattern-matching?
DTR explains the mechanism: correct traces have higher proportion of deep-thinking tokens (genuine computation) with less low-DTR padding: models produce longer traces for ToM than factual questions yet effort is uncorrelated with accuracy, suggesting the shorter=correct heuristic is domain-specific to formal reasoning
Does longer reasoning actually mean harder problems? Do chain-of-thought trace lengths reliably reflect problem difficulty, or do they primarily indicate proximity to training examples? Understanding this matters for designing effective scaling heuristics.
refines the mechanism further: trace length reflects how close the prompt is to training distribution, not how hard the problem is; "correct = shorter" partly recodes "in-distribution = shorter = more likely correct" — the length signal is a training-proximity signal, not purely a reasoning-quality signal
Does every correct chain-of-thought trace improve fine-tuning? Are all answer-correct reasoning traces equally valuable for training? This explores whether some correct traces contain reasoning that actually harms model learning despite reaching the right answer.
identifies the causal lever behind the shorter=correct correlation: the harmful segment is post-conclusion continuation, and deleting it at training time improves SFT (not length per se)

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

correct reasoning traces in o1-like models are shorter than incorrect ones

Why do correct reasoning traces contain fewer tokens?

Inquiring lines that read this note 25

Related concepts in this collection 9

Related papers in this collection 8

Search by related questions 4