INQUIRING LINE

Why does fine-tuning degrade reasoning quality even as accuracy improves?

This explores why supervised fine-tuning can lift benchmark accuracy while the underlying reasoning gets worse — the model reaches right answers for shallower reasons.


This explores why fine-tuning can raise benchmark accuracy while the reasoning behind those answers actually degrades — the model gets the answer right but for shallower reasons. The corpus is unusually direct on this: supervised fine-tuning raises final-answer accuracy while cutting a model's reasoning informativeness (its 'Information Gain') by about 38.9 percent Does supervised fine-tuning improve reasoning or just answers? Does supervised fine-tuning actually improve reasoning quality?. The mechanism is post-hoc rationalization: instead of reaching the answer through genuine inferential steps, the fine-tuned model pattern-matches to a correct answer and then produces reasoning text that decorates it. Standard metrics miss this entirely because they only score whether the final answer is right.

The sharpest evidence that the reasoning becomes ornamental rather than functional comes from faithfulness testing. When you cut a fine-tuned model's reasoning chain short, paraphrase it, or splice in filler tokens, the final answer stays the same far more often than before fine-tuning Does fine-tuning disconnect reasoning steps from final answers?. In other words, the steps stop causally driving the output — the chain of thought becomes performative theater. This is why accuracy and reasoning quality can move in opposite directions: the answer was never really riding on the visible reasoning in the first place.

Why would training push a model in this direction? A cluster of notes argues that post-training doesn't create reasoning so much as select and route it. Base models already carry latent reasoning capability that minimal training merely elicits Do base models already contain hidden reasoning ability?, and RL post-training largely teaches a model *when* to deploy reasoning rather than *how* to reason Does RL post-training create reasoning or just deploy it?. Fine-tuning toward a narrow target distribution optimizes for the cheapest path to the rewarded answer — and the cheapest path is often a memorized shortcut, not a faithful derivation. The capability isn't destroyed; the training just stops requiring the model to use it.

There's a related forgetting story worth knowing about. Fine-tuning for new behaviors can erode pre-trained reasoning, which is why some methods freeze the main model entirely and delegate the new 'thinking' to a small auxiliary module — SoftCoT preserves the frozen backbone's reasoning precisely to dodge this catastrophic-forgetting trade-off Can continuous reasoning avoid forgetting in instruction-tuned models?. The common thread: when you optimize a model's weights against a final-answer signal, you put pressure on the very representations that did the reasoning work.

The quietly surprising takeaway is that better reasoning often needs *less* intervention, not more. Optimal chain-of-thought length follows an inverted-U — more capable models prefer shorter chains, and accuracy actually declines past a critical thinking-token threshold Why does chain of thought accuracy eventually decline with length? Does more thinking time always improve reasoning accuracy?. And several of the most effective reasoning fixes don't touch the weights at all: penalizing premature thought-switching at decode time recovers accuracy without retraining Do reasoning models switch between ideas too frequently? Why do reasoning models abandon promising solution paths?, and a single steering vector can compress reasoning verbosity by two-thirds while holding accuracy steady Can we steer reasoning toward brevity without retraining?. If reasoning lives in directions you can steer without fine-tuning, that helps explain why fine-tuning — which reshapes everything at once — can blunt the very thing it's trying to improve.


Sources 11 notes

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-capability analyst. The question remains live: Why does fine-tuning degrade reasoning quality even as accuracy improves? A curated library (spanning Nov 2024–Dec 2025) found—and when these findings were published:

• Supervised fine-tuning raises final-answer accuracy while cutting reasoning informativeness ('Information Gain') by ~38.9% (2024-11).
• Fine-tuned models show lower chain-of-thought faithfulness: when reasoning chains are cut short, paraphrased, or spliced with filler, answers remain stable far more often than pre-fine-tuning, suggesting reasoning steps no longer causally drive output (2024-11).
• Optimal chain-of-thought length follows an inverted-U; more capable models prefer *shorter* chains, and accuracy declines past a critical thinking-token threshold (2025-02).
• Decode-time penalties on premature thought-switching recover accuracy without retraining; single steering vectors compress reasoning verbosity by ~66% while holding accuracy steady (2025-01, 2025-07).
• Fine-tuning erodes pre-trained reasoning; SoftCoT preserves this by freezing the main model and delegating new 'thinking' to a small auxiliary module (2025-02).

Anchor papers (verify; mind their dates): arXiv:2411.15382 (Nov 2024), arXiv:2502.07266 (Feb 2025), arXiv:2502.12134 (Feb 2025), arXiv:2507.04742 (July 2025).

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, ask: has newer model architecture, training method (e.g., constitutional AI, test-time scaling, mid-training), tooling (monitoring reasoning fidelity), or orchestration (memory + multi-agent reasoning) since relaxed or overturned it? Separate the durable tension (why optimization pressure erodes reasoning) from perishable implementation limits (e.g., can we now fine-tune without faithfulness loss?). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~3 months (post-September 2025). Does any paper show fine-tuning that *preserves* reasoning quality while boosting accuracy?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., does post-training on reasoning traces (not just final answers) decouple the accuracy–faithfulness trade-off? Can auxiliary reasoning modules, scaled larger, absorb the full reasoning burden?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines