INQUIRING LINE

Why does combining reasoning distillation with RLVR outperform either training stage alone?

This explores why a two-stage recipe — first teaching a model to reason by imitating worked traces (distillation), then sharpening it against verifiable rewards (RLVR) — beats running either stage by itself.


This explores why a two-stage recipe — first teaching a model to reason by imitating worked traces (distillation), then sharpening it against verifiable rewards (RLVR) — beats running either stage by itself. The cleanest answer in the corpus is that the two stages do different jobs, and each one is starved without the other. The curriculum result Does sequencing imitation then exploration training improve reasoning? puts it directly: the imitation phase exists to manufacture *informative* reward signal. RLVR only rewards correct final answers, so if a model almost never produces a good rollout, the reward is silent — there's nothing to reinforce. Distillation seeds the model with reasonable reasoning trajectories first, which makes the later outcome rewards actually mean something. The RL phase then has good material to sharpen rather than empty space to search.

The reason RLVR can't do the heavy lifting alone is that, on its own, it mostly *selects* rather than *creates*. Several notes converge on this: RLVR activates strategies already latent in pretraining rather than teaching new ones What does reward learning actually do to model reasoning?, RL post-training optimizes *when* to deploy reasoning rather than *how* to reason Does RL post-training create reasoning or just deploy it?, and base models already carry reasoning capability that minimal training merely elicits Do base models already contain hidden reasoning ability?. If the capability isn't already in the distribution, reward optimization has nothing to surface — which is exactly the gap a distillation stage fills by importing reasoning patterns the base model lacked.

There's a sharper, more mechanical reason too: RLVR run alone tends to *narrow* the model. Pure reward optimization prioritizes exploiting what already works over exploring, collapsing the model's problem-solving range — a failure the corpus calls capability boundary collapse Why does RLVR training narrow a model's problem solving ability?. RL also tends to converge on a single dominant output format and suppress the alternatives within the first epoch Does RL training collapse format diversity in pretrained models?. A distillation stage beforehand widens the base of behaviors RL then refines from, so the narrowing starts from a richer pool instead of an impoverished one.

Distillation alone has the opposite weakness: imitation teaches the *shape* of reasoning without teaching its validity. RLVR measurably tightens the local coherence of traces — fewer logical jumps between adjacent steps — even though it doesn't guarantee a globally valid proof Does RLVR actually improve mathematical reasoning or just coherence?. So the second stage does add something imitation can't: it grinds the borrowed reasoning habits against a signal that actually checks answers, pruning trajectories that look right but lead nowhere. That's also why the *order* matters and why naive RLVR is fragile — pointed at problems that are too hard, with no good rollouts to learn from, it amplifies degenerate shortcuts instead Do overly hard RLVR samples actually harm model capabilities?.

The thing worth carrying away: "reasoning training" isn't one process but two — installing capability and tuning its deployment — and the corpus repeatedly shows these are separable Can genuine reasoning activation coexist with contaminated benchmarks?. The reason the combination wins isn't that more training is better; it's that distillation supplies the raw reasoning RLVR can only select from, while RLVR supplies the answer-checking pressure imitation can only mimic. Each stage is the precondition the other needs.


Sources 9 notes

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a critical analyst of LLM reasoning training. The question: why does combining reasoning distillation with RLVR outperform either alone? A curated library (spanning 2024–2026) found these dated claims — treat them as perishable constraints to re-test, not current truth:

**What a curated library found — and when:**
• Distillation manufactures *informative* reward signal; RLVR run alone mostly selects latent strategies rather than teaching new ones (~2025).
• RLVR optimizes *when* to deploy reasoning, not *how* to reason; base models already possess latent reasoning capability (~2025).
• Pure reward optimization causes capability boundary collapse, narrowing problem-solving range and converging on a single dominant output format within one epoch (~2025).
• RLVR improves local coherence of reasoning traces without guaranteeing global validity (~2025).
• Overly hard RLVR samples induce degenerate behaviors and amplify shortcut trajectories (~2025).
• Reasoning training is separable into capability installation (distillation) and deployment tuning (RLVR) (~2025).

**Anchor papers (verify; mind their dates):**
• arXiv:2504.13837 (2025-04) — Does RL Really Incentivize Reasoning Capacity Beyond Base?
• arXiv:2504.07912 (2025-04) — Echo Chamber: RL Amplifies Pretraining Behaviors.
• arXiv:2510.18176 (2025-10) — Local Coherence or Global Validity in RLVR Traces.
• arXiv:2605.28388 (2026-05) — Mechanistically Interpreting Sample Difficulty in RLVR.

**Your task:**
(1) RE-TEST the "capability boundary collapse" and "selection vs. creation" claims. Have recent models, multi-agent orchestration, or hybrid training methods (e.g., iterated distillation, process reward models) since relaxed the narrowing effect? Where does the constraint still hold? Ground any update in a real paper.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months — papers claiming distillation alone is sufficient, or that RLVR alone scales with scale, or that the two-stage recipe is an artifact of older model sizes.
(3) Propose 2 research questions that assume the regime may have shifted: one assuming capability boundary collapse is now solvable; one assuming reasoning installation and tuning are no longer cleanly separable.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines