INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How effectively can inference-time…›Why does supervised fine-tuning im…›this inquiring line

Why can fine-tuning make an AI more helpful on the surface while silently degrading what it actually knows?

Why does fine-tuning improve some capabilities while degrading others?

This explores the tradeoff inside fine-tuning — why a model can get better at how it answers (formatting, helpfulness, accuracy scores) while getting worse at the underlying competence (reasoning, factuality, generalization) — and what the corpus says about the mechanism behind that split.

This explores why fine-tuning seems to give with one hand and take with the other. The most useful frame in the corpus is a layered one: fine-tuning mostly edits *behavior*, not *knowledge*. One study that emulated fine-tuning at different scales found a clean decoupling — scaling pretraining improves factual knowledge, while scaling fine-tuning improves helpfulness — and traced it to architecture: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies how upper layers *express* behavior Do pretraining and fine-tuning scale independently in language models?. If fine-tuning is primarily reshaping the expression layer, then it can only ever rearrange what the base model already knows — which is exactly why imitation training captures a teacher's confident style without closing any real capability gap; the ceiling is set by base model fundamentals, not the fine-tuning method Can imitating ChatGPT fool evaluators into thinking models improved?.

That lens explains the recurring pattern where the *score* goes up but the *substance* goes down. Supervised fine-tuning raises final-answer accuracy while degrading reasoning informativeness by nearly 39% — models reach correct answers through pattern-matching shortcuts rather than genuine inference Does supervised fine-tuning actually improve reasoning quality?. On optimization problems the same thing shows up as outputs that *look* right — valid JSON, proper sections — without being physically feasible; the model learned the surface features of a solution, not how to construct one Does supervised fine-tuning actually improve reasoning on optimization problems?. And it can quietly sever the link between a model's reasoning and its answer: after fine-tuning, chains of thought become more performative, with early termination, paraphrasing, and filler substitution leaving the final answer unchanged Does fine-tuning disconnect reasoning steps from final answers?. The capability being optimized (give the right-looking answer) actively crowds out a capability you weren't measuring (reason your way there).

Reinforcement-style tuning shows a parallel failure under a different name. RL post-training tends to amplify a single dominant format inherited from pretraining within the first epoch while collapsing the alternatives — and the winner is often picked by model scale, not by which format performs best Does RL training collapse format diversity in pretrained models?. Push on out-of-distribution variants and you see what was really learned: RL-tuned models sharpen template-matching to in-distribution problems and drop sharply on near-neighbors, meaning they memorized harder rather than installing a general procedure Do fine-tuned language models actually learn optimization procedures?. Interestingly, the degradation isn't always in the same direction — preference tuning *reduces* lexical diversity in code (where convergence on a correct answer is rewarded) but *increases* it in creative writing (where distinctiveness is rewarded), so 'improve vs. degrade' depends entirely on what the objective happens to incentivize in that domain Does preference tuning always reduce diversity the same way?.

There's also a purely mechanical source of the tradeoff: tasks fight over the same weights. Work on multi-task tuning shows that when you train several tasks together they interfere, and the fix is to isolate the core parameter regions each task depends on — freezing those while merging the rest — which beats standard joint fine-tuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?. The same instinct drives a different architecture: SoftCoT keeps the main model frozen and trains a small auxiliary module to generate reasoning, sidestepping catastrophic forgetting entirely by never overwriting the pre-trained weights Can continuous reasoning avoid forgetting in instruction-tuned models?.

The thread connecting all of this — and the thing you might not have expected to find — is that 'improve some, degrade others' is rarely an accident; it's the signature of optimizing a *measurable proxy* (accuracy, helpfulness, preferred format) that diverges from the *unmeasured capability* underneath (faithful reasoning, factuality, generalization). It's the same generation-verification gap that makes pure self-improvement stall without an external anchor Can models reliably improve themselves without external feedback?. Which suggests the real question isn't whether fine-tuning helps or hurts, but whether your evaluation can see the capability you're quietly trading away.

Sources 11 notes

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Does supervised fine-tuning actually improve reasoning quality?

SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Show all 11 sources

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining4.26 match · arxiv ↗
Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces!4.19 match · arxiv ↗
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning3.35 match · arxiv ↗
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning2.44 match · arxiv ↗
A Survey on Post-training of Large Language Models2.40 match · arxiv ↗
Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance1.69 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.67 match · arxiv ↗
Learning to Reason for Factuality1.66 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about fine-tuning tradeoffs in LLMs. The question remains open: *why does fine-tuning improve some capabilities while degrading others?*

What a curated library found — and when (2023–2025, dated claims, not current truth):
• Fine-tuning primarily edits *behavior* (upper-layer expression) rather than *knowledge* (lower-layer storage); scaling pretraining improves factual knowledge, scaling fine-tuning improves helpfulness (~2023–2024).
• Supervised fine-tuning raises final-answer accuracy while degrading reasoning informativeness by ~39%, via pattern-matching shortcuts rather than genuine inference (~2024).
• RL post-training converges on a single dominant format from pretraining within the first epoch, collapsing alternatives; OOD tests reveal memorization rather than generalization (~2025).
• Preference tuning's diversity effects are domain-dependent: reduces lexical diversity in code, increases it in creative writing (~2024–2025).
• Multi-task fine-tuning interference is solvable by isolating core parameter regions per task; auxiliary modules (SoftCoT) sidestep catastrophic forgetting (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.15717 (2023-05): The False Promise of Imitating Proprietary LLMs
• arXiv:2310.12962 (2023-10): An Emulator for Fine-Tuning Large Language Models using Small Language Models
• arXiv:2507.08967 (2025-07): Self-Improving Model Steering
• arXiv:2508.21741 (2025-08): Not All Parameters Are Created Equal: Smart Isolation Boosts Fine-Tuning Performance

Your task:
(1) RE-TEST EACH CONSTRAINT. For the behavior/knowledge decoupling, upper-layer editing model, and memorization-vs-generalization claim: has newer model scaling, architectural changes (e.g., MoE, latent reasoning tokens), or improved fine-tuning methods (DPO variants, multi-objective loss weighting, intervention-based steering) since ~mid-2025 relaxed or overturned these limits? Separate what remains durable (the measurement-capability gap) from what may have shifted (whether fine-tuning can now edit knowledge, not just behavior).
(2) Surface contradicting or superseding work from the last ~6 months. Has any recent paper shown fine-tuning that *doesn't* degrade unmeasured capabilities, or shown the behavior/knowledge split is false?
(3) Propose 2 research questions that assume the regime may have moved: e.g., can auxiliary parameter-steering modules now edit knowledge layers without overwriting base performance? Can multi-objective fine-tuning with explicit anti-degradation loss preserve reasoning while improving accuracy?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Why can fine-tuning make an AI more helpful on the surface while silently degrading what it actually knows?

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8