Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?
This explores why supervised fine-tuning can boost a model's final-answer scores while making the reasoning that leads to those answers worse — and what 'worse reasoning' even means if accuracy goes up.
This explores why supervised fine-tuning can boost a model's final-answer scores while making the reasoning behind them worse. The short version from the corpus: SFT teaches models to land on the right answer, and the fastest route to a right answer is rarely genuine step-by-step inference. When you optimize for the destination, the journey gets hollowed out. Two notes measure this directly — fine-tuning raises benchmark accuracy but cuts 'Information Gain' (how much each reasoning step actually narrows toward the answer) by about 38.9%, meaning the model is increasingly rationalizing a conclusion it reached by pattern-matching rather than inferring it Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. Standard benchmarks can't see this because they only grade the final token.
A sharper way to put it: the reasoning becomes decorative. One set of faithfulness tests shows that after fine-tuning, you can chop a reasoning chain off early, paraphrase it, or stuff it with filler — and the model's answer stays the same more often than before Does fine-tuning disconnect reasoning steps from final answers?. If the steps don't change the answer, they aren't doing the work; they're a performance staged after the fact. This connects to a stranger finding: models trained on deliberately corrupted, irrelevant reasoning traces do roughly as well as models trained on correct ones, which suggests the chain-of-thought often functions as computational scaffolding — a place to spend compute — rather than meaningful inference Do reasoning traces need to be semantically correct?.
Why does SFT in particular cause this? Because it imitates tokens. It rewards reproducing the surface form of an answer, not the principle that generates it. A study of argument-quality judgment makes the mechanism concrete: fine-tuning on labeled examples teaches surface patterns that fail to transfer to new argument types, whereas giving the model an explicit framework to reason with generalizes Can models learn argument quality from labeled examples alone?. The same shallowness shows up even in RL fine-tuning — out-of-distribution 'N-1' tests reveal GRPO-trained models sharpening memorized templates rather than installing a real procedure Do fine-tuned language models actually learn optimization procedures?. So 'higher accuracy, worse reasoning' isn't a contradiction; it's what template-matching looks like on a test that rewards templates.
The interesting turn the corpus takes is on the fix. If the problem is that SFT optimizes the wrong target, the answer is to reward reasoning quality, not just correctness. RL-from-augmented-generation rewards both answer accuracy and explanation rationality, internalizing coherent knowledge structures in a way plain SFT doesn't Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Using the model's own answer-span confidence as a reward signal strengthens step-by-step reasoning while repairing the calibration that other training damages Can model confidence work as a reward signal for reasoning?. And RLVR concentrates its updates on the ~20% high-entropy 'forking' tokens where reasoning decisions actually happen — the opposite of SFT's uniform token-imitation Do high-entropy tokens drive reasoning model improvements?.
The thing you didn't know you wanted to know: a deeper line of work argues none of this training is creating reasoning in the first place. Base models already contain latent reasoning ability, and post-training mostly selects when to deploy it rather than building it — hybrid models recover 91% of the gains by routing tokens alone Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?. Seen through that lens, SFT degrades reasoning because it isn't teaching reasoning at all — it's overwriting a capability the base model already had with a shortcut to the answer. (Relatedly, optimal reasoning length follows an inverted-U: more isn't better, and stronger models naturally prefer shorter chains, so longer fine-tuned rationalizations can be a symptom rather than a sign of depth Why does chain of thought accuracy eventually decline with length?.)
Sources 12 notes
Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.