INQUIRING LINE

Does pretraining data size matter less than base model scale for finetuning?

This explores whether, when you fine-tune a model, you get more mileage out of starting from a bigger base model than from one that was pretrained on more data — and the corpus has a surprisingly direct answer.


This explores whether starting from a bigger base model beats starting from one trained on more data when you fine-tune. The most direct evidence says yes: systematic experiments from 1B to 16B parameters find that fine-tuning follows a multiplicative scaling law, and that a larger base model improves fine-tuning results more than additional pretraining data does — while bolting on more parameter-efficient-tuning parameters barely helps at all How should finetuning scale with model and data size?. So the headline answer leans toward base scale mattering more for the fine-tuning payoff.

But the more interesting story is *why*, and here the corpus pulls the two apart into different jobs. Pretraining and fine-tuning don't scale the same axis: scaling pretraining buys factual knowledge, while scaling fine-tuning buys behavioral helpfulness — and this split has a physical home in the network, with pretraining enriching knowledge in the lower layers and fine-tuning reshaping behavior in the upper ones Do pretraining and fine-tuning scale independently in language models?. That reframes the question. Fine-tuning isn't really *adding* capability; it's activating and steering what pretraining already laid down. LIMA makes this vivid — just 1,000 carefully curated examples on a strong base model match models tuned on orders of magnitude more alignment data, because post-training surfaces existing capabilities rather than building new ones Can careful curation replace massive alignment datasets?.

If the base model is doing the heavy lifting, then *which* base you pick matters more than how much you pretrain or fine-tune. There's a clue about the mechanism: larger models learn rare tasks not because they can represent things smaller ones can't, but because their spare capacity weakens the gradient pressure that would otherwise overwrite slowly-accumulated, rare features — less interference, not more expressivity Why do larger models learn rare tasks better?. That's a capacity story, and it's why a bigger base survives fine-tuning with its knowledge more intact.

This also reframes the *risk* of fine-tuning. If knowledge lives in the lower layers, direct weight updates can corrupt it — which is exactly why decoding-time proxy-tuning preserves pretrained knowledge better, closing most of the alignment gap while leaving base weights untouched Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and why representation fine-tuning that intervenes on frozen hidden states beats weight-editing methods like LoRA at a fraction of the parameters Can editing hidden representations beat weight updates for finetuning?. The trend across all of these: protect what the base knows, and steer lightly.

The caveats worth carrying away. Bigger-base-is-better assumes the data you tune on is *compatible* with the student — teacher-refined data that overshoots a student's learning frontier actively degrades it, even when it's objectively higher quality Does teacher-refined data always improve student model performance? — and method can beat raw scale on narrow skills, as when small models trained with DPO on a teacher's right-and-wrong examples match large models on function calling Can small models match large models on function calling?. So the honest synthesis isn't "data size never matters." It's that for the *fine-tuning return on investment*, base model scale and data *quality* dominate, and raw pretraining data volume is the weakest of the three levers.


Sources 8 notes

How should finetuning scale with model and data size?

Systematic experiments across 1B–16B models reveal finetuning follows a power-based multiplicative scaling law. Larger base models improve finetuning more than more pretraining data, while increasing PET parameters provides minimal benefit.

Do pretraining and fine-tuning scale independently in language models?

Emulated Fine-Tuning reveals that scaling pretraining improves factual knowledge while scaling fine-tuning improves behavioral helpfulness. This decoupling has architectural roots: pretraining enriches lower-layer knowledge storage, while fine-tuning modifies upper-layer behavior expression.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Why do larger models learn rare tasks better?

Larger models succeed at rare tasks not because they can represent solutions smaller models cannot, but because abundant capacity weakens gradients on common tasks, preventing them from overwriting slowly-accumulating rare-task features. Data-mixture design may be cheaper than scaling.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can editing hidden representations beat weight updates for finetuning?

ReFT learns task-specific interventions on frozen model representations rather than updating weights, with LoReFT (low-rank linear subspace variant) dramatically outperforming LoRA across reasoning, instruction-following, and NLU benchmarks while using far fewer parameters.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing claims about pretraining vs. base model scale in fine-tuning. The question remains: does pretraining data size matter less than base model scale for downstream fine-tuning performance?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable:
• Larger base models improve fine-tuning results more than additional pretraining data; fine-tuning follows multiplicative scaling law (1B–16B range) (~2024).
• Pretraining enriches lower-layer knowledge; fine-tuning reshapes upper-layer behavior — separate scaling axes, not joint (~2024).
• 1,000 curated alignment examples on a strong base match orders-of-magnitude larger alignment datasets, because post-training surfaces capability rather than building it (~2023–2024).
• Larger models succeed on rare tasks via reduced gradient interference, not greater expressivity (~2026).
• Decoding-time proxy-tuning and representation fine-tuning preserve pretrained knowledge better than weight-editing (LoRA); data quality and base scale dominate raw pretraining volume (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2402.17193 (2024-02): scaling study across data, model, method
• arXiv:2404.03592 (2024-04): ReFT representation fine-tuning
• arXiv:2605.29548 (2026-05): capacity and interference in rare-task retention
• arXiv:2501.09223 (2025-01): LLM foundations overview

Your task:
(1) RE-TEST the three constraints: Does newer work (post-2026-05) show that modest pretraining data still matters if combined with architectural advances, scaling laws, or better instruction-following methods? Does the "knowledge in lower layers" thesis hold under modern mixed-precision, adapter-based, or in-context learning regimes? Is the multiplicative scaling law robust to efficient base models (mobileLLM, sub-billion)?  Separate durable questions (rare-task interference, base-scale leverage) from perishable limits (whether 1K examples truly saturate alignment on current bases).
(2) Surface work in the last ~6 months that contradicts or supersedes the "base scale > data volume" claim—e.g., discoveries that data curation or pretraining curriculum recovers the gap, or that post-training RL inverts the regime.
(3) Propose 2 research questions assuming the regime may have shifted: (a) Does the knowledge/behavior split still hold in models trained with reinforcement learning from verification (RLVR) or multi-stage post-training? (b) Under aggressive quantization or on-device constraints, does the data–scale tradeoff re-balance?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines