Why do invalid reasoning steps produce nearly the same performance gains?
This explores why chain-of-thought reasoning still boosts performance even when the intermediate steps are logically wrong, irrelevant, or corrupted — and what that tells us about what reasoning traces actually *do*.
This explores why broken reasoning steps still help: if the logic doesn't have to be valid, then the gains aren't coming from the model genuinely "thinking through" the problem. The corpus converges on a striking answer — reasoning traces work largely as *form*, not *inference*. When researchers fed models chain-of-thought exemplars that were logically invalid, performance on hard benchmarks barely budged from valid reasoning Does logical validity actually drive chain-of-thought gains?. The model learns the *shape* of step-by-step reasoning — the cadence, the connective tissue, the act of producing intermediate tokens — and that shape is what carries the gain, not the truth of any individual step.
The same pattern shows up when traces are deliberately sabotaged. Models trained on systematically irrelevant or corrupted traces hold their accuracy, and sometimes even *generalize better* out-of-distribution Do reasoning traces need to be semantically correct?. The interpretation there is that traces function as computational scaffolding — extra serial compute and a structured context window — rather than meaningful logical derivation. The steps give the model room to work, regardless of whether the steps say anything correct.
A sharper, almost provocative framing comes from work arguing that reasoning tokens are stylistic mimicry: a model's intermediate "thoughts" are generated by the exact same mechanism as any other output, with no special execution semantics, and invalid traces routinely yield correct answers Do reasoning traces actually cause correct answers?. Traces *correlate* with right answers through learned formatting — they don't *cause* them in the way the narrative suggests. This is why the question's premise holds: validity was never the load-bearing ingredient.
What's quietly fascinating is that the corpus also shows *which* steps the model actually leans on. Attention-map analysis finds that verification and backtracking steps receive almost no downstream attention — you can prune 75% of reasoning steps and keep accuracy Can reasoning steps be dynamically pruned without losing accuracy?. So a lot of what looks like careful reasoning is decorative; the model isn't reading most of it. That dovetails with the finding that real reasoning *activation* during training and *benchmark improvement* are separable phenomena — the model can pick up genuine reasoning behaviors while the score gains come from something else entirely Can genuine reasoning activation coexist with contaminated benchmarks?.
The twist worth sitting with: this doesn't mean reasoning is fake everywhere. When you measure failures at the *process* level rather than the final answer, the picture flips — checking intermediate states catches errors that final-answer scoring misses, and one system jumped from 32% to 87% success by verifying steps as they're generated Where do reasoning agents actually fail during long traces?. So validity *does* matter — just not for the easy benchmark wins that prompted this question. On short, well-trodden problems, the scaffolding alone is enough; the actual logic only starts to pay off when traces get long and the failure modes become process violations rather than wrong arithmetic.
Sources 6 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.