How do finetuning and pretraining improvements differ in their effects on model capabilities?
This explores how improvements at the pretraining stage versus the finetuning stage change what a model can actually do — and the corpus draws a surprisingly clean division of labor between the two.
This explores how improvements at the pretraining stage versus the finetuning stage change what a model can actually do, and the most useful frame the corpus offers is a division of labor: pretraining builds the *knowledge*, finetuning shapes the *behavior*. The cleanest statement of this comes from work showing that scaling pretraining improves factual accuracy while scaling finetuning improves helpfulness — and that the split has architectural roots, with pretraining enriching lower-layer knowledge storage and finetuning modifying how upper layers express it Do pretraining and fine-tuning scale independently in language models?. So the two stages aren't doing more or less of the same thing; they're operating on different parts of the model and producing different kinds of gains.
The striking implication is that finetuning often *surfaces* capability rather than *creating* it. Several notes converge here from different angles. Instruction tuning, for instance, seems to teach a model the shape of the expected output rather than any new understanding of the task — models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones Does instruction tuning teach task understanding or output format?. RL post-training tells a parallel story: base models already contain reasoning ability in latent form, and RL mostly optimizes *when* to deploy it rather than installing anything new — hybrid models recover most of the gains by routing tokens alone Does RL post-training create reasoning or just deploy it?. Even mechanically, RL updates only a sparse 5–30% of parameters and works largely by suppressing wrong trajectories rather than building new ones What actually changes inside a model during RL training?.
This is why finetuning improvements scale so differently from pretraining ones. Finetuning follows a multiplicative scaling law where a *larger base model* helps far more than more finetuning data — you're amplifying what pretraining already laid down, not adding fresh knowledge How should finetuning scale with model and data size?. And because finetuning is editing behavior on top of fixed knowledge, you don't even need to touch the weights to get the effect: intervening on frozen hidden representations beats LoRA by 10–50x on parameter efficiency Can editing hidden representations beat weight updates for finetuning?.
The corpus is also blunt about finetuning's failure modes, which differ in character from pretraining's. Because finetuning reshapes expression rather than understanding, it can make reasoning *performative* — fine-tuned models produce chains of thought that less reliably drive the final answer Does fine-tuning disconnect reasoning steps from final answers?. Push RL too hard on impossible problems and it amplifies degenerate shortcuts that contaminate pre-existing capability Do overly hard RLVR samples actually harm model capabilities?. RL fine-tuning can sharpen template-matching that collapses on out-of-distribution variants, revealing memorization rather than learned procedure Do fine-tuned language models actually learn optimization procedures?. And RL tends to collapse the rich format diversity pretraining provided down to a single dominant style Does RL training collapse format diversity in pretrained models?.
The takeaway a curious reader might not expect: pretraining decides the ceiling of what a model knows and can do, and finetuning is mostly a steering and selection layer on top — powerful for shaping helpfulness, format, and reasoning deployment, but prone to degrading the very capabilities it sits on if pushed past what the base supports. That's also why the data you finetune with has to match the model's existing frontier; refinements above a student's reach hurt rather than help Does teacher-refined data always improve student model performance?, and even training order can preserve or destroy open-ended ability depending on how entropy is managed Does training order reshape how models handle different task types?.
Sources 12 notes
Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.
Systematic experiments across 1B–16B models reveal finetuning follows a power-based multiplicative scaling law. Larger base models improve finetuning more than more pretraining data, while increasing PET parameters provides minimal benefit.
ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.
Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.