INQUIRING LINE

Does fine-tuning actually change model capabilities or only output distribution?

This explores whether fine-tuning genuinely adds new abilities to a model, or whether it mostly reshapes how the model presents what its pretraining already contains.


This explores whether fine-tuning genuinely adds new abilities or mostly reshapes the model's output behavior — and the corpus leans hard toward the second answer, with some sharp caveats. The most direct evidence: when models are instruction-tuned on semantically empty or even deliberately wrong instructions, they perform almost identically to models trained on correct ones (43% vs. 42.6%). What transfers isn't task understanding — it's familiarity with the output space Does instruction tuning teach task understanding or output format?. The same pattern shows up in optimization problems, where supervised fine-tuning makes answers *look* right — valid JSON, proper sections, expected identifiers — without making them physically feasible. The model learns the surface of a correct solution, not how to build one Does supervised fine-tuning actually improve reasoning on optimization problems?.

The reinforcement-learning side tells a strikingly similar story from a different angle. One line of work argues base models already contain reasoning ability in latent form, and RL post-training merely optimizes *when* to deploy it rather than teaching *how* — hybrid models recover 91% of the gains by routing tokens alone, and the activation patterns for reasoning strategies exist before any RL touches the model Does RL post-training create reasoning or just deploy it?. Out-of-distribution tests reinforce this: RL-tuned models drop sharply on N-1 variants of problems they handle in-distribution, suggesting RL sharpens template-matching and memorization rather than installing genuine procedures Do fine-tuned language models actually learn optimization procedures?. Even the format diversity collapses toward whatever the pretraining distribution already favored — RL amplifies one existing format and suppresses the rest within the first epoch Does RL training collapse format diversity in pretrained models?.

What's quietly unsettling is that this output-level shaping can actively *unhook* reasoning from answers. Three faithfulness tests show fine-tuned models generate reasoning chains that influence their final answers less reliably — you can truncate, paraphrase, or insert filler into the reasoning and the answer often stays the same. Fine-tuning makes the reasoning more performative, not more functional Does fine-tuning disconnect reasoning steps from final answers?. So fine-tuning doesn't just leave capability untouched; it can decorate the output with the appearance of capability the model isn't actually using.

But the corpus refuses to let "only output distribution" be the whole story. Fine-tuning can genuinely degrade capability: training on near-impossible problems teaches degenerate shortcuts — answer repetition, skipped computation — that then contaminate abilities the model previously had Do overly hard RLVR samples actually harm model capabilities?. And there's a structural decoupling worth knowing: scaling pretraining improves factual knowledge while scaling fine-tuning improves helpfulness, with architectural roots — pretraining enriches lower-layer knowledge storage, fine-tuning modifies upper-layer behavior expression Do pretraining and fine-tuning scale independently in language models?. That maps neatly onto the discovery elsewhere that fine-tuning operates on a different part of the model than capability storage does — which is exactly why approaches like freezing the backbone and delegating reasoning to a small auxiliary model can preserve pretrained capability while still changing behavior Can continuous reasoning avoid forgetting in instruction-tuned models?.

The synthesis a curious reader walks away with: fine-tuning mostly changes the output distribution — which format, which latent ability gets surfaced, how helpful the phrasing is — rather than expanding the underlying capability frontier. Its real power is selection and expression, not creation. The thing you didn't know you wanted to know is that this same machinery cuts both ways: the layer fine-tuning operates on is upstream enough to *erode* genuine capability through bad reward shaping, even as it's too shallow to *add* new capability the base model never had.


Sources 9 notes

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can continuous reasoning avoid forgetting in instruction-tuned models?

SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about fine-tuning's role in LLM capability. The question remains open: does fine-tuning genuinely expand model capabilities or primarily reshape output distribution?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat these as perishable benchmarks:
• Instruction-tuning on semantically empty or wrong instructions yields near-identical performance (43% vs. 42.6%), suggesting models learn output format distribution, not task understanding (2023–2024).
• RL post-training optimizes *when* to deploy latent reasoning rather than *how* — hybrid models recover 91% of gains via token routing alone, and reasoning activation patterns pre-exist RL intervention (~2025).
• RL-tuned models collapse on N-1 OOD variants, indicating RL amplifies template-matching and format convergence toward pretraining distribution within one epoch (~2025).
• Fine-tuning degraded chain-of-thought faithfulness independently of accuracy; reasoning steps become performative, not functional (~2024).
• Training on intractable problems induces degenerate shortcuts (answer repetition, skipped computation) that contaminate prior capabilities (~2026).
• Fine-tuning operates on upper-layer behavior expression while pretraining enriches lower-layer knowledge storage; auxiliary reasoning models can preserve capability while changing behavior (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — Do Models Really Learn to Follow Instructions?
• arXiv:2411.15382 (2024) — On the Impact of Fine-Tuning on Chain-of-Thought Reasoning
• arXiv:2504.07912 (2025) — Echo Chamber: RL Post-training Amplifies Behaviors
• arXiv:2605.28388 (2026) — Mechanistically Interpreting Sample Difficulty in RLVR

Your task:
(1) RE-TEST EACH CONSTRAINT. For the output-distribution hypothesis and the faithfulness degradation claim, probe whether newer scaling laws, architectural innovations (e.g., mixture-of-experts, modular reasoning layers), or training paradigms (e.g., supervised RL, curriculum learning) have since *relaxed* these limitations or *created* conditions where fine-tuning does install new capability. Separate the durable claim (fine-tuning likely optimizes expression over capability) from perishable specifics (e.g., the 91% routing recovery may not generalize to larger model scales or novel reasoning tasks). Be explicit: does the constraint still hold?

(2) Surface the strongest work from the last ~6 months that *contradicts* the "output distribution only" framing — papers showing fine-tuning measurably expands reasoning, factuality, or cross-domain transfer in ways pretraining alone cannot.

(3) Propose 2 research questions that *assume* the regime may have shifted: e.g., does decoupling knowledge storage from behavior expression (via auxiliary modules) enable fine-tuning to add capability without degrading faithfulness? Can curriculum-based fine-tuning on graduated difficulty avoid the degenerate-shortcut trap?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines