Why do adaptive curriculum schemes outperform static difficulty filters?
This explores why curricula that adjust problem difficulty as the model learns beat fixed schemes that pre-filter problems by difficulty once — and the corpus answer turns on the fact that 'difficulty' isn't a property of the problem but of the problem-meets-the-model-right-now.
This explores why curricula that adapt difficulty during training beat static filters that select problems once up front. The corpus has a clean answer: a static filter assumes difficulty is a fixed label attached to a problem, but the research says difficulty is really a relationship between a problem and the model's current ability — and that relationship moves. How does model ability change what samples teach? makes the core case: a sample's teaching value comes from the interaction between its difficulty and where the model is right now, and the 'productive band' of useful problems drifts within a few training steps. So any difficulty estimate you freeze at the start is stale almost immediately — a static filter is optimizing for a model that no longer exists.
Why does the band matter so much? Because the payoff curve across difficulty is shaped like an inverted U. Why do medium-difficulty problems teach reasoning better than hard ones? shows medium-difficulty problems produce the strongest reinforcement-learning gains: they mix enough successes with enough informative failures to give a clean learning signal, while easy problems have no variance to learn from and hard ones collapse into noise. The trouble is that 'medium' is defined relative to the model — yesterday's hard problem is today's medium one. An adaptive scheme keeps re-centering on that moving sweet spot; a static filter picks one slice and watches it slide out from under the model.
The failure isn't just lost opportunity — wrongly-filtered hard problems actively damage the model. Do overly hard RLVR samples actually harm model capabilities? shows that training on near-impossible problems teaches degenerate shortcuts — answer repetition, skipping computation — and these shortcuts contaminate capabilities the model already had. Because group-relative reward normalization treats a rare accidental success as a high-value trajectory, the model gets pushed hard toward whatever fluke produced it. A static filter that misjudges difficulty (or whose judgments go stale) keeps feeding these poisoned samples; an adaptive scheme drops a problem out of the active set the moment it becomes a dead zone.
The corpus also reframes 'curriculum' more broadly than difficulty-binning, which is where it gets interesting. Can curriculum learning approximate expensive process supervision? builds a curriculum not by selecting problems but by sliding the *starting point* of a single problem backward from near-completion — manufacturing a smooth difficulty ramp out of one task and recovering step-level feedback that would otherwise need expensive human annotation. Does training on messy search processes improve reasoning? points the same direction: training on the messy search process, mistakes and backtracking included, beats training only on clean optimal trajectories, because the model learns an adaptive search strategy rather than memorizing one path. Both suggest the real win of adaptivity is exposing the model to the right *failures* at the right time — something a one-shot filter structurally cannot do.
There's a parallel lesson outside reinforcement learning worth knowing: the same 'match the material to the current learner' principle shows up in distillation. Does teacher-refined data always improve student model performance? finds that objectively higher-quality teacher data *hurts* a student when it sits past the student's learning frontier — the student should filter refinements against its own ability, not accept whatever is best in the abstract. That's the static-vs-adaptive tension in miniature: 'good data' and 'good problem' are never absolute, always relative to who's learning and when. The through-line across the whole corpus is that difficulty is a verb, not a label — and adaptive schemes win because they keep measuring it while static filters assume it once and stop looking.
Sources 6 notes
A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.
RLVR learning follows an inverted-U curve across difficulty: medium problems yield strongest gains because they balance success frequency with informative failures, while easy samples lack variance and hard samples amplify shortcuts.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.