q0: Primitives for Hyper-Epoch Pretraining

Paper · arXiv 2606.03938 · Published June 2, 2026
Reasoning Critiques

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ∼56 epochs (∼4.6× fewer), or ∼67 epochs (∼3.8× fewer) when matched to the baseline’s ensemble size, and continues to improve beyond it.

Introduction. Progress in language modeling has largely come from scaling compute and data together [1, 2]. The supply of high quality text, however, is fundamentally limited while compute continues to grow, and frontier models have already consumed a substantial fraction of available data [3, 4]. As a result, scaling is increasingly entering a data-constrained regime, where further progress depends on how additional computation is used on a fixed corpus. Multi-epoch training is a natural usage of increased compute. However, performing repeated passes over the same data exhibits diminishing returns as models converge and doesn’t improve model capability after a few epochs [5, 6]. This raises the question of how compute should be allocated given a fixed dataset and a budget of N training epochs. We approach this question from first principles. Solomonoff induction, a foundational framework for generalization, suggests that one should consider a large space of hypotheses explaining the observed data and weight them according to a complexity prior [7, 8].

Discussion / Conclusion. Inference overhead. Our method assumes that the inference cost of ensembling is acceptable. It delivers its gains as an ensemble of K snapshots, so inference costs K forward passes rather than one. Smaller K recovers most of the gain, but the overhead is still prohibitive in many deployment settings. This is not fundamental. The ensemble can be distilled into a single student that runs at the inference cost of one model [6, 33]. This is compatible with our recipe, though we do not pursue it here. Training overhead. Chain distillation adds a further overhead, a teacher forward pass per cycle but without any backward pass and is therefore highly optimizable. Since the teacher is frozen within a cycle, its predictions on each example are constant across that cycle’s epochs and can be cached rather than recomputed.