SYNTHESIS NOTE

Does sequencing imitation then exploration training improve reasoning?

Can combining Supervised RL (expert imitation) followed by RLVR (outcome rewards) outperform either method alone on hard reasoning tasks? This explores whether curriculum ordering unlocks capabilities neither method achieves independently.

Synthesis note · 2026-05-18 · sourced from Training Fine Tuning

A clean curriculum-learning result from the SRL paper. Neither Supervised RL alone nor RLVR alone is the best training strategy for hard reasoning problems on small models. The strongest pipeline runs SRL first to establish a reasoning foundation, then RLVR to refine performance against verifiable rewards. The combination is more than additive — it outperforms both base methods.

The mechanism is complementary. SRL teaches the model to take reasoning actions resembling expert demonstrations. This installs the basic structure of a competent reasoning rollout, even on problems where the model would never reach the correct answer on its own. RLVR can then refine performance: given that the model now produces reasonable rollouts some of the time, outcome rewards become informative — they distinguish near-correct from off-track attempts and push the model toward the correct ones.

Without the SRL foundation, RLVR fails on hard problems because the success rate is zero. Without the RLVR refinement, SRL caps out at expert-step imitation without learning to push past the demonstrations. Each method addresses a failure mode of the other.

This is a specific instance of a broader curriculum-learning template. Different training methods have different failure-mode coverage: imitation methods fail when imitations are unreachable from the student's starting point; outcome methods fail when success is too rare. The right ordering is to use the imitation method to make outcome methods viable — build up to the regime where the harder, more capability-stretching method can produce useful signal.

For practitioners, the operational guidance is: when training small models on hard problems, do not pick between SFT/SRL and RL — sequence them. Use the imitation phase to get the model into the regime where the RL phase becomes informative, then use the RL phase to push past what imitation alone can achieve. The combined pipeline is the production setting.

The deeper observation is that "method choice" is often the wrong frame — "method sequence" frequently dominates. Curricula matter when the methods have different valid regimes.

Inquiring lines that read this note 48

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do self-generated feedback mechanisms enable effective model learning?

How does memorization interact with learning and generalization?

Why do multi-turn conversations degrade AI intent and coherence?

What repair strategies work best at each level of Clark's ladder?

What determines success in training models on multiple tasks?

How does example difficulty affect learning efficiency in language models?

Does partial trace guidance work better than curriculum learning for hard problems?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Can diverse expert demonstrations exceed the knowledge of any single expert?

Does reinforcement learning teach reasoning or just when to reason?

Do corrupted reasoning traces serve as effective supervision signals?

Why does mixing reasoning traces from different teachers destabilize learning?

How do training data properties shape reasoning capability development?

Can self-supervised signals enable process supervision without human annotation?

Does alignment training create blind spots in detecting genuine safety threats?

Can safety training and reasoning training be combined without losing calibration?

How can process reward models supervise complex reasoning traces?

What pretraining choices and baseline capability constrain reinforcement learning gains?

What constrains reinforcement learning's ability to expand model reasoning?

How do adversarial and manipulative prompts attack reasoning models?

Why does adversarial training force deeper reasoning than surface imitation?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Why does supervised fine-tuning on diverse demonstrations expand exploration diversity compared to RL?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How does stage-wise training scheduling resolve conflicts between constraint-following and creative tasks?

Do base models contain latent reasoning that training can unlock?

What pretraining formats encode latent reasoning strategies that RLVR can surface?

Can alternative training methods improve on supervised fine-tuning for language models?

Can we reverse the instruction-following deficit through targeted training?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Does sequencing imitation then exploration train… Can step-wise expert rewards help small models lea… Can curriculum learning approximate expensive proc… Does RL teach reasoning or just when to use it?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can step-wise expert rewards help small models learn hard reasoning? When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.
same paper, the parent method
Can curriculum learning approximate expensive process supervision? Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
adjacent: another curriculum-style training method
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
adjacent: how the imitation-then-RL sequence relates to what RL actually does

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

SRL-then-RLVR curriculum learning outperforms either method alone — imitation foundation then exploration refinement

Does sequencing imitation then exploration training improve reasoning?

Inquiring lines that read this note 48

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4