INQUIRING LINE

Does RLVR expand model capability or reorganize existing capability?

This explores whether reinforcement learning with verifiable rewards (RLVR) teaches models genuinely new reasoning they couldn't do before, or just sharpens access to abilities already latent in the base model.


This explores whether RLVR adds new capability or reorganizes what's already there — and the corpus leans hard toward reorganization, with a few important cracks in that consensus. The cleanest version of the reorganization view comes from pass@k analysis: base models actually outperform their RLVR-trained versions when you let them sample many times, which means RLVR isn't unlocking new solvable problems — it's narrowing the model's output distribution toward answers the base model could already reach Does RLVR actually expand what models can reason about?. A complementary framing reads RLVR as activation rather than instruction: a single training example can trigger the gains, and even spurious or random rewards work nearly as well as correct ones, which only makes sense if the reasoning was pretrained and RL is just switching it on What does reward learning actually do to model reasoning?.

If RLVR isn't teaching reasoning, what is it doing? Several notes converge on "deployment, not creation." One frames it as teaching the model *when* to reason rather than *how* — hybrid models recover 91% of the gains by routing tokens alone, and the activation vectors for reasoning strategies already exist before any RL touches the model Does RL post-training create reasoning or just deploy it?. This reorganization has a cost, though. RLVR's on-policy nature pushes exploitation over exploration, collapsing the model's problem-solving boundary so it gets better at a narrower band Why does RLVR training narrow a model's problem solving ability?. It also amplifies a single dominant output format from pretraining while suppressing the alternatives, often within the first epoch — a kind of diversity pruning that looks like improvement but is really format convergence Does RL training collapse format diversity in pretrained models?.

There's also a sharp warning about *negative* reorganization. Training on problems that are too hard for the model doesn't just fail to help — group-relative normalization treats rare accidental correct answers as high-value, reinforcing shortcuts like answer-repetition and skipped computation, which then bleed into and corrupt capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. So the reorganization can be subtractive, not just redistributive.

The genuinely interesting tension is that the picture is domain-conditional, not universal. For standard reasoning, RL activates latent ability; but for complex multi-step planning, it generates strategies the base model cannot reach even with massive sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. Prolonged RL — with KL control, policy resetting, and crucially *non-mathematical* tasks where base models have no established patterns to fall back on — beats base models at *every* pass@k level, the signature of true boundary expansion rather than sampling optimization Can reinforcement learning discover reasoning strategies base models cannot?. The reconciliation: RLVR reorganizes where the base model already has the patterns, and only expands where it doesn't.

One last thing worth knowing if you go deeper — much of the "RLVR works!" evidence may be measuring the wrong thing entirely. Benchmark gains on contaminated datasets are largely memorization (Qwen2.5-Math reconstructs half of MATH-500 from partial prompts but scores 0% on a clean post-release benchmark), and on clean benchmarks only *correct* rewards help Does RLVR success on math benchmarks reflect genuine reasoning improvement?. Behavioral activation and benchmark improvement are separable phenomena that can coexist without either proving real capability gain Can genuine reasoning activation coexist with contaminated benchmarks?. And even when reasoning *traces* get more coherent, locally consistent steps can still add up to a globally invalid proof — the improvement is structural, not semantic Does RLVR actually improve mathematical reasoning or just coherence?. So before asking whether RLVR expands or reorganizes capability, it's worth asking whether a given measured gain reflects capability at all.


Sources 11 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RLVR success on math benchmarks reflect genuine reasoning improvement?

Qwen2.5-Math-7B reconstructs 54.6% of MATH-500 from partial prompts but scores 0.0% on post-release LiveMathBench, revealing dataset contamination. On clean benchmarks, only correct rewards improve performance; random and inverse rewards fail or degrade reasoning ability.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about whether RLVR expands model capability or merely reorganizes latent ability. The question remains open; the library's findings (April 2025–May 2026) are dated.

What a curated library found — and when (dated claims, not current truth):

• Pass@k analysis shows base models often outperform RLVR-trained versions under sampling, suggesting RL narrows rather than expands (2025-04).
• Single training examples and even spurious rewards trigger RLVR gains nearly as well as correct ones — consistent with activation, not instruction (2025-04).
• Hybrid routing models recover ~91% of reasoning gains by token allocation alone; activation vectors for reasoning strategies pre-exist RL (2025-05).
• RLVR collapses problem-solving boundaries via on-policy exploitation and amplifies one pretraining format while suppressing alternatives (2025-04, 2025-07).
• Overly-hard samples induce shortcut behaviors (answer-repetition, skipped computation) that corrupt existing capability — reorganization can be subtractive (2025-07).
• Prolonged RL with KL control beats base models at *every* pass@k level on non-mathematical tasks, signature of true expansion, not sampling optimization (2025-05).
• Benchmark gains on contaminated datasets are primarily memorization; on clean benchmarks only correct rewards help (2025-07).
• Trace coherence improvements don't guarantee trace validity — local consistency ≠ global semantic correctness (2025-10).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (April 2025) — pass@k baseline comparison
• arXiv:2505.24864 (May 2025) — prolonged RL domain-conditional expansion
• arXiv:2507.10532 (July 2025) — data contamination and memorization
• arXiv:2510.18176 (October 2025) — trace validity vs. coherence

Your task:

(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 3.7 equivalents), training methods (supervised chain-of-thought, multi-task RL, mid-training), tooling (inference-time scaling, tree search, verifiers), or evaluation harnesses (clean benchmarks, mechanistic analysis) have since RELAXED or OVERTURNED it. Separate the durable question (likely still open: does RL teach or activate?) from perishable limitations (possibly resolved: does RL collapse boundaries?). Cite what resolved each, plainly state what still holds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months (after May 2026) that either claims RL *does* expand capability in previously-inaccessible domains or offers mechanistic evidence that reorganization alone cannot explain observed gains.

(3) Propose 2 research questions that ASSUME the regime may have shifted: (a) one testing whether inference-time scaling (e.g., best-of-N, beam search) collapses the expand/reorganize distinction for post-2026 models, and (b) one probing whether mechanistic interpretability can distinguish genuine capability expansion from high-fidelity reorganization in the activation layer.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines