Do pretraining and fine-tuning scale independently in language models?
Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.
Emulated Fine-Tuning (EFT) provides a principled method for sampling from a distribution that approximates combining pretraining at one scale with fine-tuning at another. This decoupling reveals: scaling up pre-training tends to improve factuality, while scaling up fine-tuning tends to improve helpfulness.
The mechanism: pretraining builds knowledge (factual storage across the parameter space), while fine-tuning shapes behavior (how that knowledge is surfaced in response to queries). These operate on different aspects of the model. Since Why does reasoning training help math but hurt medical tasks?, the decoupling has an architectural basis — pretraining enriches lower-layer knowledge, fine-tuning modifies upper-layer behavior.
A special case, LM up-scaling, avoids resource-intensive fine-tuning of large pretrained models by ensembling them with small fine-tuned models — essentially emulating the result of fine-tuning the large model. This consistently improves helpfulness and factuality across Llama, Llama-2, and Falcon families without additional training. The practical implication: you can get the benefits of fine-tuning a 70B model by fine-tuning a 7B model and combining the signals.
EFT also enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. This is relevant to Does preference optimization damage conversational grounding in large language models? — if helpfulness and harmlessness are adjustable at test time, the fixed trade-off imposed by RLHF may be unnecessary.
The decomposition challenges the assumption that a model's capabilities are monolithic. Factual knowledge and behavioral alignment are not only distinct — they scale differently and can be independently manipulated. This has implications for deployment: rather than training one large, fully-tuned model, a pipeline of specialized components (large pretrained for knowledge + small tuned for behavior) may be more efficient and more controllable.
Inquiring lines that use this note as a source 29
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do pretraining biases interact differently with prompts across model tiers?
- How much does pretraining contribute to ToM performance versus task-specific training?
- How does distributional distance from pre-training relate to model difficulty?
- What hidden costs emerge when you fine-tune models for a single domain?
- Can scaling predictions become reliable if improvements are continuous not sudden?
- What capabilities actually require massive scale versus specialized training regimes?
- Why does fine-tuning improve some capabilities while degrading others?
- Why does fine-tuning fail to remove temporal contamination from pretraining?
- What makes reasoning-specific post-training different from standard parameter scaling?
- Does fine-tuning actually change model capabilities or only output distribution?
- How do large pretrained language models scale the unified recommendation paradigm?
- Why does training order matter across different domain types?
- Does the Assistant Axis exist in pre-trained models before instruction tuning?
- How does training distribution shape what language models understand best?
- What happens to base model capabilities when you apply finetuning?
- What are the scaling law differences between vision and language learning?
- How much does pretraining quality affect the modularity of fine-tuned models?
- Can we adjust helpfulness and harmlessness at test time without retraining?
- Does fine-tuning a small model match fine-tuning a large one?
- Why should scaling laws be understood as properties of data distribution rather than training in general?
- How does pretraining determine what RL can later teach a model?
- Why does parameter-efficient tuning scaling fail to improve finetuning performance?
- Does pretraining data size matter less than base model scale for finetuning?
- Which finetuning method works best across different task and data regimes?
- How do finetuning and pretraining improvements differ in their effects on model capabilities?
- Do sample-level similarities between pretraining and downstream tasks explain the frequency effect?
- What power-law scaling patterns emerge when consistency models are trained at scale?
- How does model scale affect anticipatory behavior in structured training?
- Does finetuning facts into weights overwrite existing model capabilities?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does reasoning training help math but hurt medical tasks?
Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
architectural basis for the decoupling: knowledge in lower layers (PT) vs behavior in upper layers (FT)
-
Can decoding-time tuning preserve knowledge better than weight fine-tuning?
Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
similar decomposition philosophy: separate knowledge from behavioral adaptation
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
if behavioral traits are adjustable at test time, fixed alignment trade-offs may be avoidable
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
consistent: FT surfaces existing capabilities rather than creating new ones
-
Can architecture choices improve inference efficiency without sacrificing accuracy?
Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
shared methodology of decomposing scaling into independent dimensions: EFT decouples pretraining scale (factuality) from fine-tuning scale (helpfulness), while conditional scaling laws decouple architecture from training compute; both reveal that treating model performance as a single scalar hides independently optimizable axes
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- An Emulator for Fine-Tuning Large Language Models using Small Language Models
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts
- Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
- Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels
- FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets
- Scaling Laws for Agent Harnesses via Effective Feedback Compute
- A Survey on Post-training of Large Language Models
Original note title
scaling fine-tuning improves helpfulness while scaling pretraining improves factuality — these are decoupled training-stage effects