SYNTHESIS NOTE

Should extra compute refine one model or build many?

When a single model stops improving after multiple training epochs, is it better to keep refining that model or spend compute building a diverse population of models whose predictions aggregate better?

Synthesis note · 2026-06-27 · sourced from Reasoning Critiques

Compute is now growing faster than the supply of high-quality text, which forces a question the field used to dodge: once you have re-passed your fixed corpus enough times that a single model stops improving, what do you do with the remaining budget? The instinct is to keep refining that one model. q0 argues that instinct is wrong because a single model saturates within a few epochs — further passes hit diminishing returns long before the budget is spent. The alternative is to spend the surplus compute building a population of diverse models and aggregating their predictions, which reaches a lower validation loss than any single refined model.

The conceptual move is from optimization toward something closer to Bayesian model averaging: it grounds the design in Solomonoff induction's idea that you should weight many hypotheses rather than commit to one. q0 reduces this to three primitives — a cyclic learning-rate/weight-decay schedule that anti-correlates the two to collect diverse snapshots, chain distillation so each model trains against its predecessor and quality compounds, and a learned prior that selects and weights members for any inference budget. The headline is efficiency: matching a 256-epoch ensemble with ~56 epochs.

This connects to a recurring pattern in the vault — that diversity is an objective worth optimizing, not a side effect. Since Should training maximize diversity when models feed into search?, q0 is the pretraining-time analogue of that test-time argument: both say that converging on one best answer is the wrong target when downstream you will be aggregating or searching.

The honest limitation is inference cost: K snapshots means K forward passes, which the authors concede is prohibitive in many deployments. They note the ensemble can be distilled back into a single student — which means the practical payoff may ultimately route through distillation, and the "population vs. single model" framing is a training-time stance, not necessarily a deployment-time one. The diversity-vs-refinement tradeoff is real, but whether it survives the distillation step is the open question.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 128 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

when single-model pretraining saturates, extra compute should buy a population not a deeper model — hyper-epoch training trades refinement for diversity