What distinguishes RL that creates new capabilities from RL that merely teaches timing?
This explores the dividing line between RL that genuinely expands what a model can do versus RL that just teaches an already-capable model when to fire its reasoning — and the corpus turns out to disagree productively about where that line sits.
This question lands on a live fault line in the corpus: one camp says RL only ever reschedules existing skills, the other says it can mint new ones — and the most interesting notes explain *under what conditions* each is true.
The "merely teaches timing" view has strong support. Several notes argue that pre-training already contains the reasoning capability in latent form, and RL just optimizes *when* to deploy it: a hybrid model recovered 91% of the performance gains using only 12% of the tokens by routing alone, and the activation vectors for reasoning strategies existed *before* any RL touched the model Does RL teach reasoning or just when to use it? Does RL post-training create reasoning or just deploy it?. Pushed further, pass@k analysis shows base models actually *out-sample* RLVR models at high k — meaning RLVR narrows the distribution toward solutions the base model already had, rather than unlocking new ones Does RLVR actually expand what models can reason about?. The mechanism underneath is telling: RL updates are structurally sparse, touching only 5–30% of parameters, and work mostly by *suppressing wrong trajectories* rather than building new ones What actually changes inside a model during RL training? How does RL training reshape reasoning and what gets lost?.
But the corpus names three concrete conditions that flip RL from optimizer to creator. The first is **task complexity**: for standard reasoning RL activates latent ability, but for complex multi-step planning it generates genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. The second is **training regime and domain diversity**: prolonged RL with KL control, policy resetting, and *non-mathematical* tasks beats the base model across all pass@k levels — the capability frontier genuinely moves, especially in domains where the base model has no established pattern to fall back on Can reinforcement learning discover reasoning strategies base models cannot?. The third is **reward verifiability**: binary verifiable rewards drove gains from 0.15% to 73.98%, while fuzzy judgment-based rewards barely moved the needle — clear signals unlock suppressed capability, blurry ones can't Why does RL succeed more on some tasks than others?.
The sharpest synthesis is that "timing" and "capability" aren't rivals but *phases*. RL training reliably runs in two stages: first execution correctness drives learning (procedural consolidation), then strategic planning becomes the bottleneck and planning-token entropy rises (strategic exploration) Does RL training follow a predictable two-phase learning sequence?. So timing-optimization is the early phase; novel strategy is what *can* emerge in the second phase — if the task is hard enough and the training long enough to get there. The catch is that scheduling matters too: structured tasks collapse output entropy while creative tasks need it, so training order itself shapes whether open-ended capability survives Does training order reshape how models handle different task types?.
What you might not have expected: the frontier of "new capability" isn't only about reasoning content but about *metacognition* and *horizon*. RL has been shown to scale to multi-turn software-engineering tasks, doubling SWE-bench performance in stateful environments with delayed rewards Can reinforcement learning scale beyond single-turn language tasks? — and rewarding the *process* of reasoning (planning, reflection, monitoring) rather than just outcomes cuts wasteful repeated actions by 31% while generalizing better Can RL agents learn to reason better, not just succeed?. In other words, the cleanest way to make RL teach something new may be to reward *how* the model thinks, not just whether it landed the answer.
Sources 12 notes
For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.
Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.
RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.