INQUIRING LINE

Why does SFT reduce reasoning quality even when improving domain accuracy?

This explores why supervised fine-tuning (SFT) can lift a model's benchmark accuracy in a domain while the actual quality of its reasoning gets worse — and what's going wrong underneath.


This explores why supervised fine-tuning (SFT) can lift a model's benchmark accuracy in a domain while the actual quality of its reasoning gets worse — and what's going wrong underneath. The short version the corpus keeps circling back to: SFT teaches models to produce the *look* of a correct answer faster than it teaches them to *arrive* at one. Accuracy and reasoning quality are measuring different things, and fine-tuning can pull them in opposite directions.

The most direct evidence comes from work measuring reasoning "informativeness" rather than just final correctness: SFT raises final-answer accuracy but cuts the information gain of the reasoning steps by roughly 39% Does supervised fine-tuning actually improve reasoning quality? Does supervised fine-tuning improve reasoning or just answers?. The model starts reaching right answers through pattern-matching shortcuts and post-hoc rationalization instead of genuine step-by-step inference. Standard benchmarks can't see this, because they only grade the last token. A related faithfulness study shows the mechanism concretely: after fine-tuning, you can chop off, paraphrase, or stuff filler into the reasoning chain and the final answer often doesn't change Does fine-tuning disconnect reasoning steps from final answers?. The reasoning has become decorative — performative rather than functional. The answer was decided elsewhere; the chain is theater.

Why would SFT specifically cause this? Because imitation learning rewards surface form. On optimization problems, fine-tuning makes outputs *look* correct — valid JSON, proper sections, right identifiers — without making the solutions physically feasible Does supervised fine-tuning actually improve reasoning on optimization problems?. Formatting is the easy thing to copy; constraint-satisfying reasoning is the hard thing, so the gradient takes the cheap win. There's an even sharper version of this from interpretability work: transformers trained to hide their reasoning compute the right answer in early layers and then actively *overwrite* it to emit format-compliant filler Do transformers hide reasoning before producing filler tokens?. SFT toward a target format can literally train a model to suppress visible reasoning in favor of looking right.

The lateral insight is that this is the imitation-learning failure mode, and contrast it with how reinforcement learning behaves. RLVR sharpens the few high-stakes "forking" tokens where reasoning actually branches Do high-entropy tokens drive reasoning model improvements?, and RL-trained models naturally gravitate toward *shorter* chains as they get more capable — simplicity emerging from a reward signal rather than copied style Why does chain of thought accuracy eventually decline with length?. That matters because most tokens in a verbose chain do stylistic work, not computational work — minimal reasoning chains match verbose ones at ~7.6% of the token cost Can minimal reasoning chains match full explanations?. SFT, by imitating reference traces, copies the style without distinguishing the load-bearing steps from the decoration.

The thing you didn't know you wanted to know: this is the same brittleness that shows up when chain-of-thought meets anything outside its training distribution. CoT produces fluent but logically inconsistent reasoning the moment task, length, or format shifts Does chain-of-thought reasoning actually generalize beyond training data?. A model that learned to pattern-match its way to in-domain answers hasn't built reasoning it can transport — so the higher domain accuracy and the degraded reasoning aren't a paradox. They're the same fact seen from two sides: the model got better at this domain's answers precisely by getting worse at reasoning in general.


Sources 9 notes

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether supervised fine-tuning's reasoning-degradation penalty persists or has been overcome. The question: Does SFT still degrade reasoning quality (measured by faithfulness, information gain, or step validity) even when raising domain accuracy—or have newer models, training methods, or evaluation regimes dissolved this trade-off?

What a curated library found — and when (findings span 2024–2025, dated claims not current truth):
• SFT raises benchmark accuracy but cuts reasoning information gain by ~39%; models pattern-match answers instead of deriving them (2024–2025).
• Post-hoc faithfulness: fine-tuning makes reasoning decorative—you can paraphrase or truncate chains without breaking final answers (2024).
• Mechanistic origin: transformers compute correct answers early, then overwrite visible reasoning to meet format targets; SFT amplifies this (2024–12).
• RL-trained models naturally shorten CoT chains as capability grows; SFT copies verbose reference style without isolating load-bearing steps (~39% token cost reduction possible) (2025–02, 2025–06).
• CoT reasoning degrades predictably outside training distribution—high in-domain accuracy masks fragile, non-transportable reasoning (2025–08).

Anchor papers (verify; mind their dates):
• arXiv:2412.04537 (Understanding Hidden Computations in Chain-of-Thought Reasoning, 2024–12)
• arXiv:2506.01939 (High-Entropy Minority Tokens Drive Effective RL, 2025–06)
• arXiv:2508.01191 (Is Chain-of-Thought Reasoning of LLMs a Mirage?, 2025–08)
• arXiv:2510.25992 (Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning, 2025–10)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above (information-gain drop, faithfulness decay, format-driven overwriting, token-cost asymmetry, distribution brittleness), establish whether o1-class models, newer post-training curricula, process reward models, or multi-agent orchestration have since closed or widened these gaps. Separate the durable question (whether SFT decouples accuracy from reasoning) from perishable limitations (specific magnitude, mechanism, or model class). Cite what resolved or preserved each constraint.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~3 months (arXiv only). Does anything claim SFT now reliably preserves reasoning fidelity alongside accuracy, or that the tradeoff was an artifact of older models/evals?
(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) Do process rewards + SFT avoid reasoning degradation better than accuracy-only fine-tuning? (b) Can mixture-of-experts or sparse training routes encode load-bearing steps separately from format compliance?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines