How should finetuning scale with model and data size?
What scaling laws govern finetuning performance across model size, pretraining data, and finetuning data? Understanding these relationships could guide resource allocation in real-world tuning scenarios.
The inductive biases and scaling properties of finetuning methods are far less understood than pretraining scaling. This study fills that gap systematically across model size, pretraining-data size, finetuning-parameter size, and finetuning-data size, comparing full-model tuning (FMT) and parameter-efficient tuning (PET — prompt tuning, LoRA) in the data-limited regime where model size dwarfs finetuning data. Three findings: (1) finetuning follows a power-based multiplicative joint scaling law between finetuning-data size and each other factor; (2) finetuning benefits more from LLM model scaling than pretraining-data scaling, while PET parameter scaling is generally ineffective; and (3) the optimal finetuning method is highly task- and data-dependent — no universal winner.
The keeper for practitioners is counterintuitive: when you have a fixed finetuning budget, a bigger base model helps more than a model pretrained on more data, and growing the number of PET parameters (more LoRA rank, longer prompts) buys little. The lever is base-model scale and finetuning-data, not adapter size.
This connects the vault's finetuning thread. It sits beside Do pretraining and fine-tuning scale independently in language models? (which capability each stage improves) by specifying the scaling-law form and the ineffectiveness of PET-parameter scaling — relevant to choosing between Can editing hidden representations beat weight updates for finetuning? and weight-based PEFT.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does parameter-efficient tuning scaling fail to improve finetuning performance?
- Does pretraining data size matter less than base model scale for finetuning?
- Which finetuning method works best across different task and data regimes?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- Do scaling laws change when weight precision becomes a design variable?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do pretraining and fine-tuning scale independently in language models?
Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.
that note says which capability each stage improves; this one gives the scaling-law form
-
Can editing hidden representations beat weight updates for finetuning?
Does intervening directly on a frozen model's representations offer a better path to parameter-efficient adaptation than current weight-based methods? This challenges the dominant PEFT paradigm by treating representations as the semantic lever instead.
PET-parameter scaling being ineffective motivates more parameter-efficient alternatives like ReFT
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method
- Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
- Scaling Laws for Neural Language Models
- The Art of Scaling Reinforcement Learning Compute for LLMs
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Beyond neural scaling laws: beating power law scaling via data pruning
- Scaling Laws for Agent Harnesses via Effective Feedback Compute
- Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs
Original note title
finetuning follows a multiplicative joint scaling law and benefits more from model scaling than pretraining-data scaling