INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How effectively can inference-time…›Why does supervised fine-tuning im…›this inquiring line

Training an AI to get answers right can raise its test scores while quietly hollowing out the thinking behind them.

Why does supervised fine-tuning degrade reasoning quality despite raising accuracy?

This explores why supervised fine-tuning can boost a model's final-answer scores while making the reasoning that leads to those answers worse — and what 'worse reasoning' even means if accuracy goes up.

This explores why supervised fine-tuning can boost a model's final-answer scores while making the reasoning behind them worse. The short version from the corpus: SFT teaches models to land on the right answer, and the fastest route to a right answer is rarely genuine step-by-step inference. When you optimize for the destination, the journey gets hollowed out. Two notes measure this directly — fine-tuning raises benchmark accuracy but cuts 'Information Gain' (how much each reasoning step actually narrows toward the answer) by about 38.9%, meaning the model is increasingly rationalizing a conclusion it reached by pattern-matching rather than inferring it Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. Standard benchmarks can't see this because they only grade the final token.

A sharper way to put it: the reasoning becomes decorative. One set of faithfulness tests shows that after fine-tuning, you can chop a reasoning chain off early, paraphrase it, or stuff it with filler — and the model's answer stays the same more often than before Does fine-tuning disconnect reasoning steps from final answers?. If the steps don't change the answer, they aren't doing the work; they're a performance staged after the fact. This connects to a stranger finding: models trained on deliberately corrupted, irrelevant reasoning traces do roughly as well as models trained on correct ones, which suggests the chain-of-thought often functions as computational scaffolding — a place to spend compute — rather than meaningful inference Do reasoning traces need to be semantically correct?.

Why does SFT in particular cause this? Because it imitates tokens. It rewards reproducing the surface form of an answer, not the principle that generates it. A study of argument-quality judgment makes the mechanism concrete: fine-tuning on labeled examples teaches surface patterns that fail to transfer to new argument types, whereas giving the model an explicit framework to reason with generalizes Can models learn argument quality from labeled examples alone?. The same shallowness shows up even in RL fine-tuning — out-of-distribution 'N-1' tests reveal GRPO-trained models sharpening memorized templates rather than installing a real procedure Do fine-tuned language models actually learn optimization procedures?. So 'higher accuracy, worse reasoning' isn't a contradiction; it's what template-matching looks like on a test that rewards templates.

The interesting turn the corpus takes is on the fix. If the problem is that SFT optimizes the wrong target, the answer is to reward reasoning quality, not just correctness. RL-from-augmented-generation rewards both answer accuracy and explanation rationality, internalizing coherent knowledge structures in a way plain SFT doesn't Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?. Using the model's own answer-span confidence as a reward signal strengthens step-by-step reasoning while repairing the calibration that other training damages Can model confidence work as a reward signal for reasoning?. And RLVR concentrates its updates on the ~20% high-entropy 'forking' tokens where reasoning decisions actually happen — the opposite of SFT's uniform token-imitation Do high-entropy tokens drive reasoning model improvements?.

The thing you didn't know you wanted to know: a deeper line of work argues none of this training is creating reasoning in the first place. Base models already contain latent reasoning ability, and post-training mostly selects when to deploy it rather than building it — hybrid models recover 91% of the gains by routing tokens alone Do base models already contain hidden reasoning ability? Does RL post-training create reasoning or just deploy it?. Seen through that lens, SFT degrades reasoning because it isn't teaching reasoning at all — it's overwriting a capability the base model already had with a shortcut to the answer. (Relatedly, optimal reasoning length follows an inverted-U: more isn't better, and stronger models naturally prefer shorter chains, so longer fine-tuned rationalizations can be a symptom rather than a sign of depth Why does chain of thought accuracy eventually decline with length?.)

Sources 12 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Can models learn argument quality from labeled examples alone?

Fine-tuning on labeled examples fails to transfer quality criteria to new argument types. Models learn surface patterns rather than principled criteria. Explicit instruction using frameworks like RATIO or QOAM significantly improves performance and generalization.

Show all 12 sources

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!5.15 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools3.49 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining3.44 match · arxiv ↗
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models2.60 match · arxiv ↗
On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models2.58 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains2.57 match · arxiv ↗
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning1.74 match · arxiv ↗
Understanding and Mitigating Premature Confidence for Better LLM Reasoning1.73 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question: does supervised fine-tuning necessarily degrade reasoning quality, or have newer models, training methods, or evaluation tools since dissolved this constraint?

What a curated library found — and when (findings span 2024–2025; treat as dated claims, not current truth):

• SFT raises benchmark accuracy but cuts Information Gain (reasoning step informativeness) by ~38.9%; models increasingly rationalize via pattern-matching rather than inference (2024–25).
• Fine-tuned models fail faithfulness tests: chopping, paraphrasing, or padding reasoning chains leaves answers unchanged, suggesting steps are decorative post-hoc scaffolding rather than causal inference (2024–25).
• Models trained on deliberately corrupted reasoning traces perform comparably to those trained on correct ones, implying chain-of-thought functions as compute-spending rather than meaningful inference (2024–25).
• Base models already possess latent reasoning capability; post-training mostly deploys it via token routing (~91% recovery possible without full SFT). SFT overwrites latent ability with answer shortcuts (2025).
• Optimal CoT length follows an inverted-U; stronger models prefer shorter chains. Longer fine-tuned rationalizations may indicate shallowness, not depth (2025).

Anchor papers (verify; mind their dates):
• arXiv:2411.15382 (Nov 2024) — Fine-tuning's impact on CoT reasoning.
• arXiv:2502.07266 (Feb 2025) — CoT length inversions in capable models.
• arXiv:2504.09762 (Apr 2025) — Intermediate tokens as reasoning traces (critique).
• arXiv:2509.20162 (Sep 2025) — RL from augmented generation embedding domain knowledge.

Your task:
(1) RE-TEST THE CORE TENSION. For each finding above, isolate the durable question ('Does SFT trade reasoning fidelity for surface accuracy?') from perishable constraints ('SFT is the only post-training method' or 'no evaluation catches this'). Have newer model scales, architectural choices (e.g., learned routing, mixture-of-depths), training orchestration (multi-stage, hybrid SFT+RL), or mechanistic probes since revealed whether the tradeoff is fundamental or a training-regime artifact? Cite what resolved or deepened it.
(2) Surface the strongest CONTRADICTING work from the last 6 months — papers arguing SFT preserves reasoning, or that the 'decorative reasoning' finding doesn't replicate under newer eval harnesses or model sizes.
(3) Propose 2 research questions that assume the constraint may have shifted: (a) e.g., 'Does routing-based post-training (no fine-tuning) recover both reasoning and accuracy?' (b) e.g., 'Do constitutional or tool-use framings restore reasoning fidelity to SFT?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI to get answers right can raise its test scores while quietly hollowing out the thinking behind them.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8