SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Why do correct reasoning traces contain fewer tokens?

In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

A counterintuitive empirical finding: when comparing correct vs. incorrect solutions to the same questions in o1-like models (QwQ, DeepSeek-R1, LIMO), the correct solutions are systematically shorter. More tokens correlate with wrongness, not rightness.

This directly challenges the "longer = better" narrative underlying much of the test-time scaling literature. If scaling compute leads to longer traces, and longer traces are more likely to be incorrect, then compute scaling via trace extension is actively selecting for worse outputs.

The explanation: longer CoTs contain more self-revisions (see Does self-revision actually improve reasoning in language models?). The model overshoots, revises, introduces errors, and compounds them through revision chains. A model that gets to the right answer quickly does so because it's reasoning correctly, not because it failed to second-guess itself.

The practical implication is that trace length is a poor quality signal — and that training/inference strategies optimizing for longer traces may be optimizing in the wrong direction.

The LLM Strategic Reasoning paper (behavioral game theory evaluation of 22 LLMs) provides independent cross-domain confirmation. In competitive games, top performers (GPT-o1, DeepSeek-R1) produce the shortest CoT within their strongest games. DeepSeek-R1 in competitive games exhibits "repeated self-doubt in its CoT" that creates redundant reasoning loops inflating token usage without improvement. The pattern extends beyond math and coding to strategic interaction: across game types, longer chains signal hesitation and uncertainty, not deeper insight. See Do large language models use one reasoning style or many?.

GaslightingBench-R adds a further dimension: manipulative multi-turn prompts exploit exactly this vulnerability. By introducing misleading content into the chain, adversarial prompts extend the reasoning trace through corrupted steps. The model's own reasoning then elaborates those corrupted steps into longer wrong answers. The same length-wrongness correlation holds, but now as a designed attack surface: longer chains are more exposed to manipulation because there are more points of intervention. Why do reasoning models fail under manipulative prompts? documents this adversarial dimension.

Inquiring lines that use this note as a source 25

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
22 direct connections · 206 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

correct reasoning traces in o1-like models are shorter than incorrect ones