Does reinforcement learning create new reasoning abilities or activate existing ones?

RL post-training might either unlock latent capabilities in base models or genuinely create novel strategies. Understanding which happens under what conditions clarifies how to invest in model training effectively.

Synthesis note · 2026-02-23 · sourced from Reasoning Architectures

Two prominent claims about what RL post-training does appear contradictory:

The timing thesis: Since Does RL teach reasoning or just when to use it? and Do base models already contain hidden reasoning ability?, RL functions as a deployment optimizer. Evidence: base models outperform RLVR-trained models at high pass@k, RL-trained models show the same solution strategies as base models, and Can a single training example unlock mathematical reasoning?.

The capability thesis: Can reinforcement learning discover reasoning strategies base models cannot?. Evidence: ProRL shows strategies absent from any base model sample regardless of budget, while self-evolving curriculum RL breaks the boundary constraints identified by pass@k analysis (where Does RLVR actually expand what models can reason about?).

The domain-conditional resolution: Both are correct under different conditions. For standard math/code reasoning where the problem structure is well-represented in pretraining data, RL activates latent capability (timing thesis). For complex tasks requiring multi-step planning, tool coordination, or novel strategy recombination, RL may create genuinely new capability through prolonged training (capability thesis).

Supporting evidence for the conditional view:

RLVR pass@k boundary collapse occurs on standard benchmarks (MATH, GSM8K)
ProRL novel strategy discovery occurs on problems requiring deep planning
SWE-RL doubles baseline on long-horizon engineering tasks — beyond activation
Duration matters: short RLVR narrows boundaries while prolonged RL pushes through them

The practical implication: RL training investment should be calibrated to the target domain. For standard reasoning, minimal RL (even one example) suffices. For complex agentic tasks, sustained RL investment with evolving curricula is justified.

Inquiring lines that read this note 29

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do base models contain latent reasoning that training can unlock?

What pretraining choices and baseline capability constrain reinforcement learning gains?

What constrains reinforcement learning's ability to expand model reasoning?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

What prevents language models from reliably adopting diverse personas?

How does model capability relate to personality conditioning flexibility?

Does domain specialization cause models to lose capabilities elsewhere?

Does specialized training in one domain create capability cliffs elsewhere?

Does reinforcement learning teach reasoning or just when to reason?

How do self-generated feedback mechanisms enable effective model learning?

Does reinforcement learning create new reasoning abilities or activate existing ones?

Inquiring lines that read this note 29

Related papers in this collection 8

Search by related questions 4