INQUIRING LINE

How does Supervised RL bridge the gap between SFT and RLVR?

This explores how a middle training step — Supervised RL (SRL), an imitation phase that learns from worked examples but is shaped by reward — fixes the specific failure each of the two standard methods has when used alone: SFT copies surface form without reasoning, and RLVR can't get traction when the model never stumbles onto a correct answer to reward.


This explores how Supervised RL sits between plain supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR), and the corpus frames the bridge as solving a chicken-and-egg problem each pure method runs into. Start with what each end gets wrong on its own. SFT teaches a model what good answers look like, but the lesson stops at the surface: on optimization problems, fine-tuned models produce clean JSON, valid identifiers, and the right section headings while still violating the actual constraints — they learn the costume of a solution, not the reasoning to build one Does supervised fine-tuning actually improve reasoning on optimization problems?. RLVR has the opposite shape of failure. It only rewards verifiably-correct outcomes, so it works beautifully when the model already lands on correct answers sometimes — and goes silent when it never does. And even when it works, it mostly sharpens sampling toward solutions already in the base model's repertoire rather than teaching genuinely new reasoning Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?.

That gap is exactly where SRL lives. The curriculum result is the clearest statement: running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, beats either method used alone — because the imitation phase 'makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen' Does sequencing imitation then exploration training improve reasoning?. In other words, SRL manufactures the precondition RLVR silently assumes. RLVR needs the model to occasionally succeed so that there's something to reward; SRL gets it producing plausible reasoning trajectories first, so the reward signal stops being all-zeros.

The corpus also explains why you can't just do more of either. Pure RLVR tends to narrow rather than broaden: its on-policy nature pushes exploitation over exploration, collapsing the model's problem-solving scope — 'capability boundary collapse' — and feeding it problems that are too hard makes this worse, since rare accidental wins get treated as high-value and the model learns shortcuts and answer-repetition instead of reasoning Why does RLVR training narrow a model's problem solving ability? Do overly hard RLVR samples actually harm model capabilities?. And naively bolting SFT in front of RL isn't free either: when the expert data diverges from the model's own distribution, training goes through a destabilizing shift–readapt–overfit progression, which is why approaches like CHORD fold the supervised signal in as a dynamically-weighted auxiliary objective inside on-policy RL rather than as a separate front-loaded stage Why does SFT-then-RL training follow a predictable three-phase pattern?.

So the bridge isn't a compromise between two settings on a dial — it's a sequencing insight. SFT alone gives form without feasibility; RLVR alone needs feasibility before it can give anything. The supervised-reward middle does the unglamorous work of getting the model into the region where verifiable rewards become a usable teaching signal. Worth noting for the curious: research suggests RL changes surprisingly little of the network — only 5–30% of parameters update, in sparse but nearly full-rank subnetworks that are consistent across seeds Does reinforcement learning update only a small fraction of parameters? — which fits the picture of the RLVR phase as a precise sharpening operation on foundations laid earlier, not a wholesale rewrite.


Sources 8 notes

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Does reinforcement learning update only a small fraction of parameters?

Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether Supervised RL still bridges SFT and RLVR, or whether newer models, methods, and tooling have shifted the regimes.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable:
- SFT produces formatted correctness (valid JSON, syntax) but fails on constraint satisfaction; RLVR requires pre-existing correct rollouts to reward, else it collapses to shortcuts and capability-boundary narrowing (~2025).
- SRL curriculum (imitation → RL refinement) outperforms either method alone by manufacturing the precondition RLVR assumes: plausible reasoning trajectories (~2025).
- On-policy RLVR without prior supervised foundation exhibits narrow exploitation, degeneracy on hard problems, and answer-repetition instead of reasoning (~2025).
- RL updates only 5–30% of parameters in sparse, full-rank subnetworks, consistent across seeds — evidence of sharpening, not rewrite (~2025).
- SFT then RL exhibits shift–readapt–overfit progression when expert data diverges from model distribution; methods like CHORD fold supervised signal as dynamic auxiliary loss inside on-policy RL (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2402.05808 (Feb 2024): Reverse Curriculum RL
- arXiv:2510.25992 (Oct 2025): Supervised Reinforcement Learning from Expert Trajectories
- arXiv:2508.00222 (Jul 2025): RL-PLUS countering capability boundary collapse
- arXiv:2508.11408 (Aug 2025): On-Policy RL Meets Off-Policy Experts (CHORD-like approaches)

Your task:
(1) RE-TEST EACH CONSTRAINT. For SRT-as-bridge: Does post-October 2025 work (larger models, longer RL horizons, improved verifiable reward design, or hybrid on/off-policy methods) dissolve the need for explicit SRL curricula? Does newer data or prompt engineering make SFT-alone produce reasoning, not just format? Can modern RLVR succeed without prior imitation? Flag where constraint still holds vs. where tooling or training paradigm shift has relaxed it.
(2) SURFACE THE STRONGEST DISAGREEMENT in the last 6 months. The library frames SRL as essential sequencing; look for papers claiming SRL overhead is unnecessary, or that direct on-policy RL + good verifiable rewards + diverse sampling suffices. Highlight any tension between arXiv:2507.14843 ("Invisible Leash: why RLVR may not escape its origin") and papers proposing RL-alone solutions.
(3) Propose 2 research questions ASSUMING the regime has moved: (a) If verifiable reward quality has improved (tighter specs, fewer false negatives), how much of the SRL bridge becomes redundant? (b) Do multi-agent or hierarchical RL setups with scaffolded expert feedback outperform sequential SRL+RLVR, and if so, why?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines