INQUIRING LINE

Can combining SRL with RLVR outperform either method used alone?

This explores whether a two-stage recipe — imitation training (SRL) first to build reasoning foundations, then verifiable-reward training (RLVR) to sharpen — beats running either stage on its own.


This explores whether stacking SRL then RLVR beats either alone, and the corpus has a direct answer plus a deeper reason it works. The headline result is that running Supervised RL first to establish reasoning patterns, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation Does sequencing imitation then exploration training improve reasoning?. But the *why* is the interesting part: the imitation phase isn't just a warm-up — it makes the outcome rewards informative by producing reasonable rollouts that the RL phase can then sharpen. Without that scaffolding, RLVR has nothing useful to grab onto.

That dependency makes sense once you look at what RLVR actually does. A recurring finding across the corpus is that RLVR doesn't teach new reasoning — it activates reasoning that's already latent in the model. It works nearly as well with random rewards as correct ones because it triggers a phase transition in the output distribution rather than instilling skills, and its effectiveness tracks pretraining quality, not reward quality Why does RLVR work with completely random rewards?. The same theme shows up in the finding that RL functions as *selection, not discovery* — the pretrained prior bounds what exploration can reach, which is why the choice of RL algorithm barely matters Does the choice of RL algorithm actually matter for reasoning?. If RLVR can only select and amplify what's already there, then an SRL imitation phase that plants better candidate behaviors is exactly the lever that raises the ceiling RLVR is working under.

The corpus also explains what goes wrong when you skip the foundation. RLVR fed problems that are too hard for the current model induces degenerate shortcuts — answer repetition, computation-skipping — because rare accidental successes get treated as high-advantage trajectories and reinforced Do overly hard RLVR samples actually harm model capabilities?. An SRL phase that first makes hard problems solvable converts those uninformative, all-or-nothing reward signals into a usable gradient. This is the same logic dressed differently: the imitation stage manufactures the 'reasonable rollouts' that keep the reward signal meaningful instead of degenerate.

Two cautions worth carrying into any claim of 'outperforms.' First, RLVR's gains are partly structural rather than semantic — it measurably improves coherence between adjacent reasoning steps without guaranteeing the whole proof is valid Does RLVR actually improve mathematical reasoning or just coherence?. Second, benchmark improvements can be memorization on contaminated datasets rather than genuine reasoning, and behavioral activation and benchmark scores are separable phenomena that can move independently Does RLVR success on math benchmarks reflect genuine reasoning improvement? Can genuine reasoning activation coexist with contaminated benchmarks?. So 'the curriculum wins' is most trustworthy when measured on clean, post-release benchmarks and on the activation of reasoning behavior, not just a leaderboard number.

The thing you might not have known you wanted to know: combining the two methods isn't additive, it's enabling. SRL doesn't add a separate increment of skill on top of RLVR — it changes what RLVR is *able* to do by giving a selection-and-amplification process something worth selecting. That reframes the whole 'better training recipe' question into a sequencing question about when imitation makes reward signals legible.


Sources 7 notes

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Why does RLVR work with completely random rewards?

RLVR works nearly as well with spurious rewards as correct ones because it catalyzes a phase transition in model output distribution. The effectiveness depends on pretraining quality, not reward signal quality or training volume.

Does the choice of RL algorithm actually matter for reasoning?

Expert Iteration, PPO, and RC-RL perform comparably on reasoning because exploration is constrained by the pretrained distribution, not the optimizer. RL functions as selection, not discovery—the prior contains most solutions the algorithm will find.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether SRL→RLVR curriculum learning remains the strongest composition strategy for LLM reasoning. The question: does combining Supervised RL with RLVR outperform either alone, and if so, how durable is that advantage under current conditions?

What a curated library found—and when (dated claims, not current truth):
Findings span 2024–2026; treat all as perishable:
• SRL-then-RLVR curriculum substantially outperforms both methods in isolation; SRL primes rollouts that make RLVR's reward signal legible rather than degenerate (~2024–25).
• RLVR functions as selection and amplification of latent reasoning, not discovery; it works nearly equally well with random or correct rewards because it triggers a phase transition in the output distribution (~2025).
• RLVR without prior imitation induces shortcuts (answer repetition, computation-skipping) on overhard problems; SRL converts all-or-nothing rewards into usable gradients (~2024–25).
• RLVR improves local coherence between reasoning steps without guaranteeing global trace validity; gains may be structural rather than semantic (~2025–26).
• Benchmark improvements on contaminated datasets are primarily memorization; behavioral activation and leaderboard scores are separable (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.05808 (2024-02) – Reverse Curriculum RL foundation
• arXiv:2504.07912 (2025-04) – Echo Chamber: pretraining-driven behavior amplification
• arXiv:2507.10532 (2025-07) – Data contamination and memorization confounds
• arXiv:2510.18176 (2025-10) – Local coherence vs. global validity in RLVR traces

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether newer training regimes (sparse reward, negative RL, asynchronous scaling), model scale, architectural innovations (diffusion-based reasoning, multi-agent orchestration), or post-hoc validation tooling have relaxed or overturned each claim. Separate the durable question—does imitation-then-RL composition remain beneficial?—from perishable limitations (e.g., memorization on contaminated benchmarks). Cite what resolved each constraint; state plainly where it still appears to hold.
(2) Surface the strongest contradicting or superseding work from the last ~6 months. Has any paper shown curriculum learning unnecessary, or a single-stage method that matches SRL→RLVR gains?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., does the curriculum advantage persist under negative RL? Does asynchronous large-scale RLVR (AReaL-class systems) still require imitation priming?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines