Why does combining reasoning distillation with RLVR outperform either training stage alone?
This explores why a two-stage recipe — first teaching a model to reason by imitating worked traces (distillation), then sharpening it against verifiable rewards (RLVR) — beats running either stage by itself.
This explores why a two-stage recipe — first teaching a model to reason by imitating worked traces (distillation), then sharpening it against verifiable rewards (RLVR) — beats running either stage by itself. The cleanest answer in the corpus is that the two stages do different jobs, and each one is starved without the other. The curriculum result Does sequencing imitation then exploration training improve reasoning? puts it directly: the imitation phase exists to manufacture *informative* reward signal. RLVR only rewards correct final answers, so if a model almost never produces a good rollout, the reward is silent — there's nothing to reinforce. Distillation seeds the model with reasonable reasoning trajectories first, which makes the later outcome rewards actually mean something. The RL phase then has good material to sharpen rather than empty space to search.
The reason RLVR can't do the heavy lifting alone is that, on its own, it mostly *selects* rather than *creates*. Several notes converge on this: RLVR activates strategies already latent in pretraining rather than teaching new ones What does reward learning actually do to model reasoning?, RL post-training optimizes *when* to deploy reasoning rather than *how* to reason Does RL post-training create reasoning or just deploy it?, and base models already carry reasoning capability that minimal training merely elicits Do base models already contain hidden reasoning ability?. If the capability isn't already in the distribution, reward optimization has nothing to surface — which is exactly the gap a distillation stage fills by importing reasoning patterns the base model lacked.
There's a sharper, more mechanical reason too: RLVR run alone tends to *narrow* the model. Pure reward optimization prioritizes exploiting what already works over exploring, collapsing the model's problem-solving range — a failure the corpus calls capability boundary collapse Why does RLVR training narrow a model's problem solving ability?. RL also tends to converge on a single dominant output format and suppress the alternatives within the first epoch Does RL training collapse format diversity in pretrained models?. A distillation stage beforehand widens the base of behaviors RL then refines from, so the narrowing starts from a richer pool instead of an impoverished one.
Distillation alone has the opposite weakness: imitation teaches the *shape* of reasoning without teaching its validity. RLVR measurably tightens the local coherence of traces — fewer logical jumps between adjacent steps — even though it doesn't guarantee a globally valid proof Does RLVR actually improve mathematical reasoning or just coherence?. So the second stage does add something imitation can't: it grinds the borrowed reasoning habits against a signal that actually checks answers, pruning trajectories that look right but lead nowhere. That's also why the *order* matters and why naive RLVR is fragile — pointed at problems that are too hard, with no good rollouts to learn from, it amplifies degenerate shortcuts instead Do overly hard RLVR samples actually harm model capabilities?.
The thing worth carrying away: "reasoning training" isn't one process but two — installing capability and tuning its deployment — and the corpus repeatedly shows these are separable Can genuine reasoning activation coexist with contaminated benchmarks?. The reason the combination wins isn't that more training is better; it's that distillation supplies the raw reasoning RLVR can only select from, while RLVR supplies the answer-checking pressure imitation can only mimic. Each stage is the precondition the other needs.
Sources 9 notes
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.