Can deliberate corruption of reasoning traces harm out of distribution generalization?
This explores whether deliberately feeding a model wrong or irrelevant reasoning steps damages its ability to handle inputs unlike its training data — and the corpus suggests the surprising answer is mostly no, which tells us something unsettling about what reasoning traces are actually doing.
This explores whether corrupting reasoning traces hurts out-of-distribution (OOD) generalization. The most direct answer in the collection cuts against intuition: it usually doesn't, and can even help. Models trained on systematically irrelevant or scrambled traces hold their accuracy and *sometimes generalize better* out of distribution, which points to traces working as a kind of computational scaffolding — a fixed amount of token-budget for the model to 'spread out' its computation — rather than as meaningful logical steps the answer depends on Do reasoning traces need to be semantically correct?. If the words in the trace were load-bearing logic, garbling them should wreck OOD behavior. The fact that it doesn't is the interesting part.
That finding only makes sense once you accept a broader claim running through the corpus: chain-of-thought is constrained imitation, not inference. Several notes converge here — traces reproduce the *form* of reasoning by pattern-matching, which is why structurally valid but logically invalid prompts still succeed and why format effects dominate content What makes chain-of-thought reasoning actually work? Why does chain-of-thought reasoning fail in predictable ways?. One study shows intermediate tokens carry no special execution semantics at all: invalid traces frequently produce correct answers, so the trace correlates with the answer through learned formatting, not function Do reasoning traces actually cause correct answers?. There's even mechanistic backing — models can compute the right answer in their early layers, then actively overwrite it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. If the real work happens elsewhere, corrupting the visible trace leaves the work intact.
But 'corruption doesn't matter' shouldn't be read as 'reasoning is robust OOD.' The opposite note in the collection is just as strong: CoT is distribution-bounded and degrades *predictably* once you shift the task, length, or format, producing fluent-but-inconsistent reasoning that imitates the form without valid logic underneath Does chain-of-thought reasoning actually generalize beyond training data?. So the thing that breaks OOD isn't whether the trace is 'correct' — it's whether the input still resembles training. A clean demonstration: trace length tracks problem difficulty only inside the training distribution and decouples entirely outside it, because length reflects recall of memorized schemas, not adaptive thinking Does longer reasoning actually mean harder problems?.
Where corruption *does* bite is at the token level. Local memorization — predicting from immediately preceding tokens — drives up to 67% of reasoning errors, and its share grows precisely as complexity rises and distributional shift sets in Where do memorization errors arise in chain-of-thought reasoning?. So the harm isn't from semantically wrong content per se; it's from the model leaning on surface token patterns when it's pushed off-distribution. Relatedly, fine-tuning can quietly sever the causal link between steps and answers, making reasoning performative — early termination, paraphrasing, and filler substitution leave answers unchanged Does fine-tuning disconnect reasoning steps from final answers?.
The takeaway you might not have gone looking for: if you want to actually *improve* OOD behavior, the lever isn't trace correctness but trace selection and control. Step-level confidence filtering catches reasoning breakdowns that global averaging hides, getting majority-voting-level gains from far fewer traces — quality of selection beats quantity Does step-level confidence outperform global averaging for trace filtering?. And decoding-time nudges like thought-switching penalties recover accuracy from models that 'wander' and abandon good paths, no fine-tuning required Why do reasoning models abandon promising solution paths?. In other words: the trace's words can be corrupted with little cost, but *how the model navigates and trusts those traces* is where OOD generalization is won or lost.
Sources 11 notes
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.