Why do invalid reasoning prompts work as well as valid ones?
This explores why chains of reasoning that are logically broken or even nonsensical still produce correct answers — and what that tells us about whether the reasoning is doing the work, or just the look of reasoning is.
This explores why chains of reasoning that are logically broken still produce correct answers — and what that reveals about what the reasoning step is actually doing. The blunt finding the corpus keeps landing on: the *form* of reasoning matters far more than its *validity*. When researchers fed models chain-of-thought examples that were logically invalid, performance held up nearly as well as with valid ones Does logical validity actually drive chain-of-thought gains?. The model learns the shape of step-by-step reasoning — the rhythm, the connective tissue, the gesture of working through — not genuine inference. Validity turns out to be decorative.
The same pattern shows up when you go further and deliberately corrupt the traces. Models trained on systematically irrelevant or scrambled reasoning steps keep their accuracy, and sometimes generalize *better* to out-of-distribution problems Do reasoning traces need to be semantically correct?. That points to a striking reframe: the reasoning trace functions as computational scaffolding — extra tokens that give the model more room to compute — rather than a meaningful logical argument. The content of the steps is less important than the fact that there are steps at all.
This dovetails with a quieter, almost unsettling claim about what those intermediate tokens really are. The 'thinking' you see a reasoning model produce carries no special execution semantics; it's generated the same way as any other output, and invalid traces routinely yield correct answers Do reasoning traces actually cause correct answers?. The trace correlates with the answer through learned formatting, not because the model is following its own stated logic. We read the trace as a window into a causal process; it's closer to stylistic mimicry of one.
There's a deeper reason the gap between valid and invalid washes out: producing reasoning and *judging* reasoning are separate skills, and models are trained almost entirely on the first. Frontier models that solve problems near-perfectly score as low as 48% when asked to grade solutions with correct answers but broken steps Can models that reason well also grade reasoning well?. If a model can't reliably tell valid reasoning from invalid even when looking right at it, it's no surprise that training on invalid examples doesn't hurt — the model was never optimizing for validity in the first place, only for answers.
The interesting frontier is what this *doesn't* mean. 'The steps don't matter' isn't quite right either — when researchers actually verify intermediate states during generation rather than scoring only the final answer, task success can jump from 32% to 87%, because many real failures are process violations that final-answer scoring never sees Where do reasoning agents actually fail during long traces?. So the resolution is subtle: invalid traces work as well as valid ones for *getting the answer* on benchmark problems where the scaffolding is enough — but on long, compounding tasks where steps must actually hold together, the validity you can ignore at small scale comes back to bite. The form gets you surprisingly far; it just isn't the same thing as reasoning.
Sources 5 notes
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Frontier reasoning models solve problems near-perfectly but score as low as 48% when grading solutions with correct answers but flawed steps. Outcome-focused training rewards answer production, not step-by-step verification, leaving evaluation starved.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.