Why do deliberately corrupted reasoning traces sometimes generalize better than correct ones?
This explores why training a model on reasoning traces full of wrong or irrelevant steps can match — and occasionally beat — training on correct ones, especially on problems unlike those it was trained on.
This explores why training a model on reasoning traces full of wrong or irrelevant steps can match — and occasionally beat — training on correct ones, especially on out-of-distribution problems. The corpus's answer is bracing: the surprise dissolves once you stop believing the trace is reasoning at all. When models trained on systematically irrelevant traces hold their accuracy and sometimes generalize *better* Do reasoning traces need to be semantically correct?, it's evidence that the trace works as computational scaffolding — a budget of intermediate tokens that gives the model room to compute — rather than a chain of meaningful steps the model follows. Several notes converge on the same demolition: intermediate tokens carry no special execution semantics and are generated exactly like any other output, so invalid traces routinely yield correct answers Do reasoning traces actually cause correct answers?, and the performance gains simply aren't coming from semantic correctness Do reasoning traces show how models actually think?.
If correctness isn't the active ingredient, what is? Form. Chain-of-thought turns out to mirror the *shape* of reasoning without the logical abstraction — format matters more than content, and invalid prompts work about as well as valid ones What makes chain-of-thought reasoning fail in language models?. So a corrupted trace that preserves the right structural rhythm — planning, then steps, then a turn toward an answer — still delivers the scaffolding benefit, because the model learned to ride the format, not the facts inside it.
That reframing also explains the *better* part, not just the comparable part. Correct traces can actively hurt: when a model keeps reasoning after the answer is already settled, that post-conclusion exploration degrades fine-tuning — and removing only that tail helps more than removing an equally long random suffix, so the damage is from unnecessary exploration, not length Does every correct chain-of-thought trace improve fine-tuning?. A trimmed or 'corrupted' trace that lacks the meandering tail can be a cleaner training signal than a verbose correct one. This dovetails with the finding that models wander and underthink — exploring invalid paths and abandoning good ones — which are failures of structure, not compute Why do reasoning models abandon promising solution paths?.
The deeper payoff is knowing where the illusion has limits. Not all of a trace is interchangeable noise: planning and backtracking sentences act as genuine 'thought anchors' that causally steer what follows Which sentences actually steer a reasoning trace?, so corruption that spares those pivots costs little. And the scaffolding story only holds near the training distribution — push problems out of distribution and CoT degrades predictably, producing fluent but logically inconsistent output Does chain-of-thought reasoning actually generalize beyond training data?, while trace length stops tracking difficulty and instead reflects how close the problem sits to memorized schemas Does longer reasoning actually mean harder problems?.
The thing you didn't know you wanted to know: 'corrupted traces generalize better' isn't a glitch — it's a measurement of how little of the visible reasoning was ever load-bearing. The fix that follows is to stop trusting the final answer as proof the process was sound, and verify the intermediate steps themselves — which is exactly where reliability is won, with one study lifting task success from 32% to 87% by checking the process rather than the output Where do reasoning agents actually fail during long traces?.
Sources 10 notes
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Research shows CoT mirrors reasoning form without true logical abstraction. Format matters more than content, invalid prompts work as well as valid ones, and scaling reasoning creates instruction-following deficits.
Post-conclusion reasoning—where the model keeps exploring after sufficient evidence for the answer—degrades supervised fine-tuning despite preserving correctness. Removing only this tail improves learning more than removing equally-long random suffixes, proving the harm comes from unnecessary exploration, not length.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.