Does reinforcement learning create new reasoning abilities or activate existing ones?
RL post-training might either unlock latent capabilities in base models or genuinely create novel strategies. Understanding which happens under what conditions clarifies how to invest in model training effectively.
Two prominent claims about what RL post-training does appear contradictory:
The timing thesis: Since Does RL teach reasoning or just when to use it? and Do base models already contain hidden reasoning ability?, RL functions as a deployment optimizer. Evidence: base models outperform RLVR-trained models at high pass@k, RL-trained models show the same solution strategies as base models, and Can a single training example unlock mathematical reasoning?.
The capability thesis: Can reinforcement learning discover reasoning strategies base models cannot?. Evidence: ProRL shows strategies absent from any base model sample regardless of budget, while self-evolving curriculum RL breaks the boundary constraints identified by pass@k analysis (where Does RLVR actually expand what models can reason about?).
The domain-conditional resolution: Both are correct under different conditions. For standard math/code reasoning where the problem structure is well-represented in pretraining data, RL activates latent capability (timing thesis). For complex tasks requiring multi-step planning, tool coordination, or novel strategy recombination, RL may create genuinely new capability through prolonged training (capability thesis).
Supporting evidence for the conditional view:
- RLVR pass@k boundary collapse occurs on standard benchmarks (MATH, GSM8K)
- ProRL novel strategy discovery occurs on problems requiring deep planning
- SWE-RL doubles baseline on long-horizon engineering tasks — beyond activation
- Duration matters: short RLVR narrows boundaries while prolonged RL pushes through them
The practical implication: RL training investment should be calibrated to the target domain. For standard reasoning, minimal RL (even one example) suffices. For complex agentic tasks, sustained RL investment with evolving curricula is justified.
Inquiring lines that use this note as a source 27
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What other latent LLM capabilities remain inactive without explicit activation cuing?
- How does baseline capability level affect RL improvement ceiling?
- What behavioral changes occur during reward learning training?
- What makes reasoning capability a pre-training rather than post-training phenomenon?
- What capabilities actually require massive scale versus specialized training regimes?
- Do emergent abilities result from genuine new capabilities or implicit in-context learning?
- How does model capability relate to personality conditioning flexibility?
- Does specialized training in one domain create capability cliffs elsewhere?
- Does RL refine existing knowledge or discover entirely new capabilities?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- What other triggers can activate the latent reasoning capability?
- Does RLVR expand model capability or reorganize existing capability?
- Why does prolonged RL discover strategies absent from any base model sample?
- Can pretraining signals unlock latent reasoning that post-training merely activates?
- What happens to base model capabilities when you apply finetuning?
- How does post-training shift models from passive prediction to on-policy action?
- Does RL training activate latent meta-learning capacity or create it from scratch?
- Why do overtrained domains show different RL training outcomes than novel tasks?
- What training duration is actually needed for RL to expand capabilities?
- Can the exploration ceiling be raised beyond what pretraining established?
- What does RL post-training actually teach reasoning systems?
- Does RLVR teach new reasoning or activate existing pretraining capabilities?
- Why does pre-training provide the raw material for emergent thinking?
- How does pretraining determine what RL can later teach a model?
- Can RL create new reasoning primitives that pretraining never established?
- What makes content informative and not-yet-mastered for reinforcement during pretraining?
- What makes a model fail to activate relevant skills from its own harness?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Eliciting Reasoning in Language Models with Cognitive Tools
- Teaching Large Language Models to Reason with Reinforcement Learning
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
- Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
Original note title
RL capability creation is domain-conditional — standard reasoning activates latent capability while complex planning may generate genuinely novel strategies