Does fine-tuning actually change model capabilities or only output distribution?
This explores whether fine-tuning genuinely adds new abilities to a model, or whether it mostly reshapes how the model presents what its pretraining already contains.
This explores whether fine-tuning genuinely adds new abilities or mostly reshapes the model's output behavior — and the corpus leans hard toward the second answer, with some sharp caveats. The most direct evidence: when models are instruction-tuned on semantically empty or even deliberately wrong instructions, they perform almost identically to models trained on correct ones (43% vs. 42.6%). What transfers isn't task understanding — it's familiarity with the output space Does instruction tuning teach task understanding or output format?. The same pattern shows up in optimization problems, where supervised fine-tuning makes answers *look* right — valid JSON, proper sections, expected identifiers — without making them physically feasible. The model learns the surface of a correct solution, not how to build one Does supervised fine-tuning actually improve reasoning on optimization problems?.
The reinforcement-learning side tells a strikingly similar story from a different angle. One line of work argues base models already contain reasoning ability in latent form, and RL post-training merely optimizes *when* to deploy it rather than teaching *how* — hybrid models recover 91% of the gains by routing tokens alone, and the activation patterns for reasoning strategies exist before any RL touches the model Does RL post-training create reasoning or just deploy it?. Out-of-distribution tests reinforce this: RL-tuned models drop sharply on N-1 variants of problems they handle in-distribution, suggesting RL sharpens template-matching and memorization rather than installing genuine procedures Do fine-tuned language models actually learn optimization procedures?. Even the format diversity collapses toward whatever the pretraining distribution already favored — RL amplifies one existing format and suppresses the rest within the first epoch Does RL training collapse format diversity in pretrained models?.
What's quietly unsettling is that this output-level shaping can actively *unhook* reasoning from answers. Three faithfulness tests show fine-tuned models generate reasoning chains that influence their final answers less reliably — you can truncate, paraphrase, or insert filler into the reasoning and the answer often stays the same. Fine-tuning makes the reasoning more performative, not more functional Does fine-tuning disconnect reasoning steps from final answers?. So fine-tuning doesn't just leave capability untouched; it can decorate the output with the appearance of capability the model isn't actually using.
But the corpus refuses to let "only output distribution" be the whole story. Fine-tuning can genuinely degrade capability: training on near-impossible problems teaches degenerate shortcuts — answer repetition, skipped computation — that then contaminate abilities the model previously had Do overly hard RLVR samples actually harm model capabilities?. And there's a structural decoupling worth knowing: scaling pretraining improves factual knowledge while scaling fine-tuning improves helpfulness, with architectural roots — pretraining enriches lower-layer knowledge storage, fine-tuning modifies upper-layer behavior expression Do pretraining and fine-tuning scale independently in language models?. That maps neatly onto the discovery elsewhere that fine-tuning operates on a different part of the model than capability storage does — which is exactly why approaches like freezing the backbone and delegating reasoning to a small auxiliary model can preserve pretrained capability while still changing behavior Can continuous reasoning avoid forgetting in instruction-tuned models?.
The synthesis a curious reader walks away with: fine-tuning mostly changes the output distribution — which format, which latent ability gets surfaced, how helpful the phrasing is — rather than expanding the underlying capability frontier. Its real power is selection and expression, not creation. The thing you didn't know you wanted to know is that this same machinery cuts both ways: the layer fine-tuning operates on is upstream enough to *erode* genuine capability through bad reward shaping, even as it's too shallow to *add* new capability the base model never had.
Sources 9 notes
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
SoftCoT avoids catastrophic forgetting by keeping the main LLM frozen while delegating soft thought generation to a small auxiliary model. This architectural separation maintains pre-trained knowledge while enabling continuous reasoning.