Does curriculum-based training keep small models perpetually at their learning edge?
This explores whether curriculum training — feeding small models problems matched to their current ability and advancing as they improve — actually keeps them learning, or whether it stalls; the corpus suggests the 'learning edge' is a real, measurable frontier, and staying on it is the whole game.
This explores whether curriculum-based training keeps small models perpetually at their learning edge. The corpus is unusually pointed on this, because several notes independently converge on a single idea: there's a frontier of difficulty where a model actually learns, and pushing past it doesn't accelerate progress — it destroys it. The clearest statement of the principle is that teacher-refined data degrades performance once it exceeds the student's learning frontier, even when that data is objectively higher quality; the student does better by filtering refinements against its own statistical profile and keeping only what's compatible Does teacher-refined data always improve student model performance?. The 'edge' isn't a metaphor here — it's a measurable boundary, and stepping over it is harmful.
What happens when you ignore the edge and just train hard? You get degeneration, not growth. Overly hard RLVR samples push models to learn shortcuts — answer repetition, computation-skipping — and those shortcuts then contaminate capabilities the model already had, because group-relative normalization treats rare accidental successes on impossible problems as high-value trajectories to reinforce Do overly hard RLVR samples actually harm model capabilities?. So the failure mode of 'always too hard' isn't stagnation, it's active regression. This is the strongest argument *for* curriculum: difficulty has to be staged precisely because mis-staged difficulty does damage.
The corpus also shows what a good curriculum signal looks like for small models specifically. Supervised RL gives step-wise rewards based on similarity to expert actions, which provides a dense learning signal even when every rollout fails — and the note explicitly frames this as a curriculum foundation laid *before* outcome-based refinement, bridging rigid imitation (SFT) and sparse outcome-only rewards (RLVR) Can step-wise expert rewards help small models learn hard reasoning?. The same staging logic shows up in journey learning, where training on messy trajectories — failures, backtracking, self-correction — produces more robust reasoning than training on clean shortcut solutions Can models learn better by training on messy exploration paths?. The 'edge' isn't just difficulty calibration; it's exposure to the struggle itself.
But here's the part that complicates 'perpetually.' Staying at the edge requires staying *plastic*, and plasticity is fragile. Low KL drift from the base model preserves the ability to keep learning new tasks — models that stay close to their base distribution keep adapting when domains shift, while parameter-only RL stalls once the task changes Does staying close to the base model preserve learning ability?. The flip side is that ordinary RL post-training quietly collapses diversity, converging on a single dominant format within the first epoch and suppressing alternatives Does RL training collapse format diversity in pretrained models?. So a curriculum can keep handing a model harder problems while the model is silently losing the representational range it needs to learn from them. 'Perpetual learning edge' is therefore conditional, not automatic — it holds only if the training procedure also protects plasticity.
The honest answer: curriculum keeps a small model at its learning edge *if* difficulty is matched to its current frontier, if the curriculum is staged (dense expert signal first, outcome rewards later), and if drift from the base is kept low so plasticity survives. Get any of those wrong and the edge becomes a cliff — and notably, small models that respect these constraints can punch far above their size, from DPO-trained small models matching large ones on function calling Can small models match large models on function calling? to student cross-encoders exceeding their own LLM teachers once given enough well-targeted data Can smaller models outperform their LLM teachers with enough data?. The edge is where small models win — but nothing keeps them there for free.
Sources 8 notes
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
Research shows that training on messy trajectories—failed attempts, self-correction, and backtracking—teaches more robust reasoning than training only on shortcut solutions. This approach models o1-style deep reasoning as search internalization rather than solution memorization.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.