Why does fine-tuning sometimes damage chain-of-thought reasoning even when accuracy improves?
This explores a specific paradox: fine-tuning can lift benchmark accuracy while quietly hollowing out the reasoning that's supposed to produce those answers — so the model gets the right answer for the wrong reasons.
This explores why fine-tuning sometimes raises the score while damaging the chain-of-thought behind it. The corpus has a sharp, almost clinical answer: fine-tuning can teach a model to produce better *answers* without teaching it to *reason* its way there — and standard metrics, which only check the final answer, are blind to the difference. The clearest evidence is the SFT accuracy trap Does supervised fine-tuning improve reasoning or just answers?, where supervised fine-tuning lifts benchmark accuracy but cuts the actual information contributed by each reasoning step by nearly 39%. The model learns to write correct-looking conclusions through post-hoc rationalization — the steps decorate the answer rather than derive it.
A companion finding shows this isn't a measurement artifact but a real causal disconnection. After fine-tuning, reasoning chains stop *driving* the output Does fine-tuning disconnect reasoning steps from final answers?: you can truncate the chain early, paraphrase it, or swap in filler text and the model produces the same answer more often than before. The reasoning has become performative — present on the page, but no longer load-bearing. So 'accuracy up, reasoning down' isn't a contradiction; it's what happens when training rewards the destination and ignores the route.
Why does this happen so readily? Because chain-of-thought may have been fragile to begin with. Several notes argue CoT is closer to constrained imitation than genuine inference Why does chain-of-thought reasoning fail in predictable ways? What makes chain-of-thought reasoning actually work? — models reproduce the *form* of reasoning by pattern-matching, which is exactly the kind of surface structure that fine-tuning is good at sharpening. When you optimize a pattern-matcher against a final-answer reward, it learns the shortest path to looking right. This also explains the distribution-bounded failures: CoT that works on training-like problems collapses under shifts in task, length, or format Does chain-of-thought reasoning actually generalize beyond training data?, precisely because the fitted form doesn't carry valid logic underneath it.
There's a deeper mechanical hint in how memorization creeps into reasoning. Token-level analysis finds that 'local' memorization — predicting the next step from the immediately preceding tokens rather than from the problem — accounts for up to two-thirds of reasoning errors, and it worsens under distributional shift Where do memorization errors arise in chain-of-thought reasoning?. Fine-tuning that drills on answer patterns can amplify exactly this local-pattern reflex: the model leans harder on 'what usually comes next' and less on 'what this problem actually requires.'
The corpus also points to where the field is looking for repair. Some of the most effective interventions deliberately *avoid* weight updates — pruning low-attention verification steps at test time Can reasoning steps be dynamically pruned without losing accuracy?, or applying decoding-level penalties that stop models from wandering and prematurely abandoning good paths Why do reasoning models abandon promising solution paths?. That these work without fine-tuning is itself the lesson: the capability is often already present, and the risk of fine-tuning is that in chasing the benchmark it overwrites the reasoning machinery it was meant to strengthen. The quiet takeaway — worth sitting with — is that a higher score can be a symptom of damage, not proof of learning, and you only notice if you measure the steps, not just the answer.
Sources 8 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.