SYNTHESIS NOTE

Do pretraining and fine-tuning scale independently in language models?

Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

Emulated Fine-Tuning (EFT) provides a principled method for sampling from a distribution that approximates combining pretraining at one scale with fine-tuning at another. This decoupling reveals: scaling up pre-training tends to improve factuality, while scaling up fine-tuning tends to improve helpfulness.

The mechanism: pretraining builds knowledge (factual storage across the parameter space), while fine-tuning shapes behavior (how that knowledge is surfaced in response to queries). These operate on different aspects of the model. Since Why does reasoning training help math but hurt medical tasks?, the decoupling has an architectural basis — pretraining enriches lower-layer knowledge, fine-tuning modifies upper-layer behavior.

A special case, LM up-scaling, avoids resource-intensive fine-tuning of large pretrained models by ensembling them with small fine-tuned models — essentially emulating the result of fine-tuning the large model. This consistently improves helpfulness and factuality across Llama, Llama-2, and Falcon families without additional training. The practical implication: you can get the benefits of fine-tuning a 70B model by fine-tuning a 7B model and combining the signals.

EFT also enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. This is relevant to Does preference optimization damage conversational grounding in large language models? — if helpfulness and harmlessness are adjustable at test time, the fixed trade-off imposed by RLHF may be unnecessary.

The decomposition challenges the assumption that a model's capabilities are monolithic. Factual knowledge and behavioral alignment are not only distinct — they scale differently and can be independently manipulated. This has implications for deployment: rather than training one large, fully-tuned model, a pipeline of specialized components (large pretrained for knowledge + small tuned for behavior) may be more efficient and more controllable.

Inquiring lines that read this note 31

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can prompting inject entirely new knowledge into language models?

How do pretraining biases interact differently with prompts across model tiers?

Do base models contain latent reasoning that training can unlock?

How much does pretraining contribute to ToM performance versus task-specific training?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Do autonomous architecture discoveries follow predictable scaling laws?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Why does fine-tuning improve some capabilities while degrading others?

How do training data properties shape reasoning capability development?

What makes reasoning-specific post-training different from standard parameter scaling?

How can LLM recommenders match or exceed collaborative filtering performance?

How do large pretrained language models scale the unified recommendation paradigm?

Do language model representations contain causally steerable task-specific features?

Does the Assistant Axis exist in pre-trained models before instruction tuning?

How do training priors constrain what context information can override?

How does training distribution shape what language models understand best?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Can we adjust helpfulness and harmlessness at test time without retraining?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does pretraining determine what RL can later teach a model?

Why does finetuning cause catastrophic forgetting of model capabilities?

Does finetuning facts into weights overwrite existing model capabilities?

Do language models learn genuine linguistic structure or just surface patterns?

How do parameter scaling and latent vectors interact in language models?

What are the consequences of models training on synthetic data?

Why does the same training data produce different gains across models?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 184 in 2-hop network ·dense cluster Open in graph ↗

Do pretraining and fine-tuning scale independent… Why does reasoning training help math but hurt med… Can decoding-time tuning preserve knowledge better… Does preference optimization damage conversational… Do base models already contain hidden reasoning ab… Can architecture choices improve inference efficie…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does reasoning training help math but hurt medical tasks? Explores whether reasoning and knowledge rely on different network mechanisms, and why training one might undermine the other across different domains.
architectural basis for the decoupling: knowledge in lower layers (PT) vs behavior in upper layers (FT)
Can decoding-time tuning preserve knowledge better than weight fine-tuning? Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
similar decomposition philosophy: separate knowledge from behavioral adaptation
Does preference optimization damage conversational grounding in large language models? Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
if behavioral traits are adjustable at test time, fixed alignment trade-offs may be avoidable
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
consistent: FT surfaces existing capabilities rather than creating new ones
Can architecture choices improve inference efficiency without sacrificing accuracy? Standard scaling laws optimize training efficiency but ignore inference cost. This explores whether architectural variables like hidden size and attention configuration can unlock inference gains without trading off model accuracy under fixed training budgets.
shared methodology of decomposing scaling into independent dimensions: EFT decouples pretraining scale (factuality) from fine-tuning scale (helpfulness), while conditional scaling laws decouple architecture from training compute; both reveal that treating model performance as a single scalar hides independently optimizable axes

Do pretraining and fine-tuning scale independently in language models?

Inquiring lines that read this note 31

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4