When does RL actually extend reasoning beyond pretraining?
Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.
Whether RL post-training extends reasoning beyond what pretraining gave, or merely refines sampling, is one of the field's live disputes — and the disagreement persists because modern pipelines are uncontrolled (opaque pretraining corpora, under-examined mid-training, RL interacting with unknown priors). This paper builds a fully controlled framework — synthetic reasoning tasks with explicit atomic operations, parseable step traces, and systematic manipulation of training distributions — to isolate each stage's causal contribution, evaluating extrapolative generalization (harder compositions) and contextual generalization (new surface contexts).
The reconciliation is precise: RL produces true capability gains (measured at pass@128, not just pass@1) only when two conditions hold — pretraining leaves sufficient headroom, and RL data targets the model's edge of competence (difficult but not out of reach). When pretraining already established the reasoning primitives, RL's job is to extend their composition; when it didn't, RL cannot conjure them.
This is the controlled-experiment capstone for the vault's RLVR-capability cluster. It sharpens Does RL teach reasoning or just when to use it? and Why does RLVR work with completely random rewards? by adding the headroom + edge-of-competence conditions under which RL genuinely extends (not just samples) capability — and it gives actionable guidance for data curricula and compute allocation across stages.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does pretraining determine what RL can later teach a model?
- Can RL create new reasoning primitives that pretraining never established?
- How do extrapolative and contextual generalization measure RL reasoning gains?
- When does reinforcement learning actually produce true reasoning gains in models?
- Does targeting the edge of competence during RL pretraining unlock true reasoning gains?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
this adds the controlled-experiment conditions under which RL extends vs merely activates
-
Why does RLVR work with completely random rewards?
RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
both bound what RL actually contributes; this specifies the pretraining-headroom precondition
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the headroom condition is the flip side: RL can only extend primitives pretraining already laid down
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Eliciting Reasoning in Language Models with Cognitive Tools
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
- Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
- Teaching Large Language Models to Reason with Reinforcement Learning
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
Original note title
RL produces true reasoning gains only when pretraining leaves headroom and RL data targets the model's edge of competence