Do correct reasoning traces tend to be shorter than incorrect ones?
This explores whether the length of a model's reasoning trace predicts whether it got the answer right — and what that correlation actually tells us about how reasoning models work.
This explores whether the length of a model's reasoning trace predicts whether it got the answer right. The short answer from the corpus is yes, with a caveat: across o1-style models like QwQ, DeepSeek-R1, and LIMO, correct solutions tend to use *fewer* tokens than incorrect ones Why do correct reasoning traces contain fewer tokens?. But the interesting part isn't the correlation — it's why it exists. Longer traces tend to pile on self-revisions, and each revision is a fresh chance to introduce an error that then compounds. So length isn't a symptom of a hard problem being worked through; it's often a symptom of the model thrashing.
That reframing connects to a finding that complicates the naive 'more thinking = better' intuition. Optimal chain-of-thought length follows an inverted-U: accuracy climbs with more reasoning up to a point, then declines as chains get too long Why does chain of thought accuracy eventually decline with length?. Strikingly, more capable models prefer *shorter* chains, and RL training naturally drifts toward brevity as models improve — simplicity emerges from the reward signal rather than being trained in directly. So shortness isn't just correlated with correctness; for stronger models it's a signature of competence.
But here's the thing the question doesn't anticipate: trace length may not measure difficulty at all. Controlled maze experiments show length correlates with problem difficulty only when problems resemble the training distribution — out-of-distribution, the relationship decouples entirely Does longer reasoning actually mean harder problems?. Length mostly reflects how well the model can recall a familiar schema, not how much adaptive computation a problem demands. That means a long trace might signal 'this looks unfamiliar' rather than 'this is genuinely hard,' which helps explain why long traces and wrong answers travel together.
Why do the long ones go wrong? Two reinforcing failure modes: 'wandering' into invalid exploration and 'underthinking' by abandoning promising paths too early — structural disorganization, not insufficient compute Why do reasoning models abandon promising solution paths?. Decoding-level nudges that penalize premature thought-switching improve accuracy without any fine-tuning, which says the model often *had* a viable path and talked itself out of it. The extra length is the wreckage of that wandering, not productive deliberation.
The deepest caveat is that the trace may not be doing the reasoning at all. Intermediate tokens carry no special execution semantics — they're generated like any other output, and structurally invalid or even deliberately corrupted traces routinely yield correct answers Do reasoning traces actually cause correct answers? Do reasoning traces need to be semantically correct?. If traces function more as computational scaffolding than as verified logic, then 'shorter = correct' is less a window into better thinking and more a statistical fingerprint of when the model is operating near familiar, well-rehearsed territory. The practical upshot: rather than scoring trace length, step-level confidence filtering catches breakdowns mid-trace and stops early Does step-level confidence outperform global averaging for trace filtering? — quality of the trace, not its quantity, is what's worth measuring.
Sources 7 notes
Across QwQ, DeepSeek-R1, and LIMO, correct solutions average fewer tokens than incorrect ones. Longer traces correlate with more self-revisions, which introduce and compound errors rather than improve reasoning quality.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.