Does sequencing imitation then exploration training improve reasoning?
Can combining Supervised RL (expert imitation) followed by RLVR (outcome rewards) outperform either method alone on hard reasoning tasks? This explores whether curriculum ordering unlocks capabilities neither method achieves independently.
A clean curriculum-learning result from the SRL paper. Neither Supervised RL alone nor RLVR alone is the best training strategy for hard reasoning problems on small models. The strongest pipeline runs SRL first to establish a reasoning foundation, then RLVR to refine performance against verifiable rewards. The combination is more than additive — it outperforms both base methods.
The mechanism is complementary. SRL teaches the model to take reasoning actions resembling expert demonstrations. This installs the basic structure of a competent reasoning rollout, even on problems where the model would never reach the correct answer on its own. RLVR can then refine performance: given that the model now produces reasonable rollouts some of the time, outcome rewards become informative — they distinguish near-correct from off-track attempts and push the model toward the correct ones.
Without the SRL foundation, RLVR fails on hard problems because the success rate is zero. Without the RLVR refinement, SRL caps out at expert-step imitation without learning to push past the demonstrations. Each method addresses a failure mode of the other.
This is a specific instance of a broader curriculum-learning template. Different training methods have different failure-mode coverage: imitation methods fail when imitations are unreachable from the student's starting point; outcome methods fail when success is too rare. The right ordering is to use the imitation method to make outcome methods viable — build up to the regime where the harder, more capability-stretching method can produce useful signal.
For practitioners, the operational guidance is: when training small models on hard problems, do not pick between SFT/SRL and RL — sequence them. Use the imitation phase to get the model into the regime where the RL phase becomes informative, then use the RL phase to push past what imitation alone can achieve. The combined pipeline is the production setting.
The deeper observation is that "method choice" is often the wrong frame — "method sequence" frequently dominates. Curricula matter when the methods have different valid regimes.
Inquiring lines that use this note as a source 46
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does extended exoskeleton use eventually produce meaningful skill transfer?
- Can curated demonstrations compensate for smaller or simpler training environments?
- What repair strategies work best at each level of Clark's ladder?
- Does task ordering affect multi-task reinforcement learning outcomes?
- How do developmental curriculums emerge from learning progress signals?
- Does partial trace guidance work better than curriculum learning for hard problems?
- Can diverse expert demonstrations exceed the knowledge of any single expert?
- Can RL teach when to use reasoning versus when to respond directly?
- Why does mixing reasoning traces from different teachers destabilize learning?
- Why does imitation learning create a ceiling for reasoning capability?
- Does reverse-curriculum learning approximate process supervision using only outcome signals?
- Can safety training and reasoning training be combined without losing calibration?
- Why does outcome supervision fail for long reasoning chains?
- Why does combining reasoning distillation with RLVR outperform either training stage alone?
- Why does critique training produce deeper understanding than imitation training?
- How does behavior cloning reduce complexity before RL training in rerankers?
- Why do instruction following and reasoning capability trade off in training?
- Can one training example activate mathematical reasoning in RL-trained models?
- Why does imitation learning alone plateau without outcome-based refinement?
- How does Supervised RL bridge the gap between SFT and RLVR?
- What failure modes do imitation and outcome methods each address?
- How do complete multi-turn trajectories differ from isolated task examples?
- How does a challenger's escalating difficulty function as curriculum?
- How do reward signals in RLVR interact with pretraining biases?
- Why does adversarial training force deeper reasoning than surface imitation?
- Why does medium difficulty outperform both easy and hard RLVR training samples?
- How should multi-objective post-training balance competing behavioral goals?
- Why does step-level expert alignment work when outcome-only RL fails?
- Does RL primarily teach when to use reasoning or how to reason?
- Can the exploration ceiling be raised beyond what pretraining established?
- What's the difference between RLHF, RLVR, and RLCF as training paradigms?
- How does RPT compare to learning when versus how to deploy reasoning?
- Why do six different RLVR algorithms converge on similar performance levels?
- How does prolonged RL training differ from standard RLVR approaches?
- Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?
- What does process supervision reveal about step-level reasoning versus outcome rewards?
- How does stage-wise training scheduling resolve conflicts between constraint-following and creative tasks?
- Why does curriculum order matter when information theory says data order is irrelevant?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- What pretraining formats encode latent reasoning strategies that RLVR can surface?
- How does action-level decomposition differ from token-level imitation in supervision?
- Can combining SRL with RLVR outperform either method used alone?
- Why does extended reasoning training improve exploration without adding new capabilities?
- Why does the pretrained prior determine the exploration ceiling?
- Does targeting the edge of competence during RL pretraining unlock true reasoning gains?
- What makes exploration a verifiable and measurable training objective?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can step-wise expert rewards help small models learn hard reasoning?
When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.
same paper, the parent method
-
Can curriculum learning approximate expensive process supervision?
Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
adjacent: another curriculum-style training method
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
adjacent: how the imitation-then-RL sequence relates to what RL actually does
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Beyond the Trade-off: Self-Supervised Reinforcement Learning for Reasoning Models' Instruction Following
- Escaping the Verifier: Learning to Reason via Demonstrations
- RL-PLUS: Countering Capability Boundary Collapse of LLMs in Reinforcement Learning with Hybrid-policy Optimization
- Spurious Rewards: Rethinking Training Signals in RLVR
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
Original note title
SRL-then-RLVR curriculum learning outperforms either method alone — imitation foundation then exploration refinement