Why does curriculum learning with tight budgets beat fixed-budget approaches?
This explores why training that starts with a generous token budget and gradually tightens it beats training with a single fixed budget — and what that says about how models actually learn.
This explores why curriculum budgets that start loose and gradually tighten outperform a single fixed budget throughout training. The cleanest answer in the corpus is that the two phases are doing genuinely different jobs. A generous early budget lets the model *explore* — it discovers which reasoning strategies work before being asked to be efficient. Only then does the tightening phase *compress* those strategies, distilling them under constraint. Collapse exploration and compression into one fixed budget and you ask the model to be efficient at something it hasn't learned to do yet Does gradually tightening token budgets beat fixed budget training?.
What makes this land is a recurring lesson elsewhere in the collection: difficulty has to be metered to where the model currently is, not set as a constant. Push too hard, too early, and training actively backfires — models fed nearly-impossible problems learn degenerate shortcuts (answer repetition, skipping computation) that then contaminate skills they already had Do overly hard RLVR samples actually harm model capabilities?. A fixed tight budget is a version of this same trap: it imposes a constraint the model can't yet satisfy honestly, so it satisfies it dishonestly. The same boundary shows up with distillation — teacher refinements that exceed a student's current 'learning frontier' degrade it even when they're objectively better Does teacher-refined data always improve student model performance?.
The deeper pattern is that the *ordering* of difficulty is itself the teaching signal, independent of the content. Reverse-curriculum RL slides the reasoning start state backward from near-completion, manufacturing a difficulty ramp that exposes step-level failures using only cheap outcome feedback — a curriculum that buys you expensive process supervision for free Can curriculum learning approximate expensive process supervision?. Even without any token budget, ordering few-shot examples from hard to easy by their activation sparsity yields real gains with no difficulty labels at all Can representation sparsity order few-shot demonstrations effectively?. In every case the schedule, not just the data, is what's being learned from.
Here's the part you might not have expected to want: this whole story works because the capability is usually already *latent* and just needs activating under the right pressure. A single RLVR example can lift math accuracy from 36% to 73.6% and keep improving long after training accuracy saturates Can a single training example unlock mathematical reasoning?. That reframes the budget curriculum: the generous phase isn't installing new reasoning, it's surfacing reasoning the model can already reach, and the tightening phase prunes it down. It also explains the failure direction — a fixed-budget regime prunes before anything has surfaced, which is also why RL post-training tends to collapse onto a single dominant strategy early and suppress the alternatives it never let the model explore Does RL training collapse format diversity in pretrained models?.
Sources 7 notes
Models trained with progressively tightening token budgets consistently achieve higher accuracy and better token efficiency than fixed-budget baselines. The approach works by separating learning into exploration (discovering strategies with generous budgets) and compression (distilling them under constraints).
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Sparsity-Guided Curriculum In-Context Learning uses last-layer activation sparsity to order demonstrations from sparse (harder) to dense (easier), yielding considerable performance improvements. This approach requires no external difficulty labels and works across diverse in-context learning tasks.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.