SYNTHESIS NOTE

Topics›RLVR›this note

Does RLVR actually expand what models can reason about?

Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.

Synthesis note · 2026-02-22 · sourced from RLVR

The strongest empirical challenge to the "RL teaches reasoning" narrative comes from pass@k analysis. At small k (e.g., k=1), RLVR models outperform their base models — they produce correct answers more reliably on any given attempt. But as k increases, base models consistently surpass RLVR models across all benchmarks and model families. The reasoning paths that RLVR models generate are already present in the base model's sampling distribution.

This reframes what RLVR actually does. Rather than expanding the frontier of solvable problems, RLVR narrows the sampling distribution toward correct solutions that were already accessible. The model learns to find correct paths more efficiently, not to reason in fundamentally new ways. Manual inspection confirms: for most problems where RLVR models succeed, the base model can produce at least one correct chain-of-thought.

Six popular RLVR algorithms (including GRPO, PPO variants) perform similarly and all remain far from optimal in leveraging the base model's potential — they converge on similar subsets of the base model's capability space. This suggests the bottleneck is not algorithmic but structural: on-policy RL with verifiable rewards optimizes sampling, not capability.

The contrast with distillation is sharp. Distillation from a stronger teacher can transfer genuinely new reasoning patterns, expanding the student's reasoning scope beyond what the base model could sample. Since Does RL teach reasoning or just when to use it?, the RLVR finding fits: activation is not creation. But distillation is creation — it writes new patterns into the model's distribution.

The practical implication: if you need capabilities the base model doesn't have, distillation from a stronger model is the path. If the base model can already solve the problem (given enough samples), RLVR makes it reliable. These are different tools for different gaps.

Inquiring lines that read this note 111

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do we evaluate AI systems when user perception misleads actual performance?

Does good simulation eventually count as genuine realization?

Can model confidence signals reliably improve reasoning quality and calibration?

What constrains reinforcement learning's ability to expand model reasoning?

What properties determine whether reward signals teach genuine reasoning?

How does memorization interact with learning and generalization?

How much RLVR improvement comes from benchmark data memorization?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can clean benchmarks reveal true RLVR reasoning gains?

What structural advantages do diffusion language models offer over autoregressive methods?

Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?

Why do reward structures fail to shape long-term agent learning?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Does reinforcement learning teach reasoning or just when to reason?

How do adversarial and manipulative prompts attack reasoning models?

Why does verification consistently lag behind AI generation?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

How do self-generated feedback mechanisms enable effective model learning?

Can language model RL training avoid reward hacking and misalignment?

Can alternative training methods improve on supervised fine-tuning for language models?

How do Q-value models improve action selection compared to value models?

Can ensemble evaluation methods reduce bias more than single judges?

Can judges trained on both verifiable and non-verifiable tasks transfer across domains?

How can models identify insufficient information and respond appropriately without guessing?

What makes abstention a learnable behavior instead of a default penalty?

Can next-token prediction alone produce genuine language understanding?

Do high-entropy RLVR tokens correspond to MI-peak tokens during inference?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?

How can AI agents autonomously learn and transfer skills across tasks?

Can process supervision improve agentic RL through meta-reasoning rewards?

How can process reward models supervise complex reasoning traces?

How much data do generative process reward models actually need?

Can prompting inject entirely new knowledge into language models?

Why does prompting discover capabilities that need reward-driven refinement?

Do base models contain latent reasoning that training can unlock?

What pretraining formats encode latent reasoning strategies that RLVR can surface?

How can AI systems learn from failures without cascading errors?

Can held-out validation gates prevent optimizer hallucinations in skill proposals?

How do aggregate reward models systematically exclude minority user preferences?

What makes reward models fundamentally different from policy discriminators?

What are the consequences of models training on synthetic data?

How does off-policy data reuse inside trust regions affect convergence guarantees?

How do policy learning algorithm choices affect multi-objective optimization stability?

Can on-policy optimization variants avoid the probability squeezing problem?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

22 direct connections · 165 in 2-hop network ·medium cluster Open in graph ↗

Does RLVR actually expand what models can reason… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab… Can reinforcement learning discover reasoning stra… Can simple rewards alone teach complex domain reas…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
RLVR confirms the timing-not-capability thesis with pass@k evidence
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
RLVR finding shows the latent capability is an upper bound, not a floor
Can reinforcement learning discover reasoning strategies base models cannot? Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
tension: this claims RL does expand boundaries under prolonged training
Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
emergence may operate at a different level than sampling efficiency

Does RLVR actually expand what models can reason about?

Inquiring lines that read this note 111

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4