SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Reasoning, Retrieval, and Evaluation

Do pretraining and fine-tuning scale independently in language models?

Can we decouple how model scale affects different training stages to independently improve factuality versus helpfulness? This matters for understanding whether these capabilities compete or can be optimized separately.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning
What kind of thing is an LLM really? How do you build domain expertise into general AI models? How should researchers navigate LLM reasoning research?

Emulated Fine-Tuning (EFT) provides a principled method for sampling from a distribution that approximates combining pretraining at one scale with fine-tuning at another. This decoupling reveals: scaling up pre-training tends to improve factuality, while scaling up fine-tuning tends to improve helpfulness.

The mechanism: pretraining builds knowledge (factual storage across the parameter space), while fine-tuning shapes behavior (how that knowledge is surfaced in response to queries). These operate on different aspects of the model. Since Why does reasoning training help math but hurt medical tasks?, the decoupling has an architectural basis — pretraining enriches lower-layer knowledge, fine-tuning modifies upper-layer behavior.

A special case, LM up-scaling, avoids resource-intensive fine-tuning of large pretrained models by ensembling them with small fine-tuned models — essentially emulating the result of fine-tuning the large model. This consistently improves helpfulness and factuality across Llama, Llama-2, and Falcon families without additional training. The practical implication: you can get the benefits of fine-tuning a 70B model by fine-tuning a 7B model and combining the signals.

EFT also enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training. This is relevant to Does preference optimization damage conversational grounding in large language models? — if helpfulness and harmlessness are adjustable at test time, the fixed trade-off imposed by RLHF may be unnecessary.

The decomposition challenges the assumption that a model's capabilities are monolithic. Factual knowledge and behavioral alignment are not only distinct — they scale differently and can be independently manipulated. This has implications for deployment: rather than training one large, fully-tuned model, a pipeline of specialized components (large pretrained for knowledge + small tuned for behavior) may be more efficient and more controllable.

Inquiring lines that use this note as a source 29

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 180 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

scaling fine-tuning improves helpfulness while scaling pretraining improves factuality — these are decoupled training-stage effects