Should extra compute refine one model or build many?
When a single model stops improving after multiple training epochs, is it better to keep refining that model or spend compute building a diverse population of models whose predictions aggregate better?
Compute is now growing faster than the supply of high-quality text, which forces a question the field used to dodge: once you have re-passed your fixed corpus enough times that a single model stops improving, what do you do with the remaining budget? The instinct is to keep refining that one model. q0 argues that instinct is wrong because a single model saturates within a few epochs — further passes hit diminishing returns long before the budget is spent. The alternative is to spend the surplus compute building a population of diverse models and aggregating their predictions, which reaches a lower validation loss than any single refined model.
The conceptual move is from optimization toward something closer to Bayesian model averaging: it grounds the design in Solomonoff induction's idea that you should weight many hypotheses rather than commit to one. q0 reduces this to three primitives — a cyclic learning-rate/weight-decay schedule that anti-correlates the two to collect diverse snapshots, chain distillation so each model trains against its predecessor and quality compounds, and a learned prior that selects and weights members for any inference budget. The headline is efficiency: matching a 256-epoch ensemble with ~56 epochs.
This connects to a recurring pattern in the vault — that diversity is an objective worth optimizing, not a side effect. Since Should training maximize diversity when models feed into search?, q0 is the pretraining-time analogue of that test-time argument: both say that converging on one best answer is the wrong target when downstream you will be aggregating or searching.
The honest limitation is inference cost: K snapshots means K forward passes, which the authors concede is prohibitive in many deployments. They note the ensemble can be distilled back into a single student — which means the practical payoff may ultimately route through distillation, and the "population vs. single model" framing is a training-time stance, not necessarily a deployment-time one. The diversity-vs-refinement tradeoff is real, but whether it survives the distillation step is the open question.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Should training maximize diversity when models feed into search?
If a model runs inside a test-time search loop that samples many rollouts and picks the best, does training for entropy and diversity unlock better solutions than training for a single sharp answer?
convergent-with: the same diversity-over-convergence principle, applied at pretraining rather than test time
-
Can we prune training data without hurting model performance?
This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
convergent-with: both attack the data-constrained regime by changing how a fixed corpus is used rather than adding data
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
grounds the diversity claim: another mechanism (critique) that counteracts the tail-narrowing q0 avoids by populating distinct trajectories
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- q0: Primitives for Hyper-Epoch Pretraining
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- The Serial Scaling Hypothesis
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models
- A Primer in Post-Training Reasoning Data: What We Know About How It Works
- Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
- Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration
Original note title
when single-model pretraining saturates, extra compute should buy a population not a deeper model — hyper-epoch training trades refinement for diversity