INQUIRING LINE

What distinguishes RL that creates new capabilities from RL that merely teaches timing?

This explores the dividing line between RL that genuinely expands what a model can do versus RL that just teaches an already-capable model when to fire its reasoning — and the corpus turns out to disagree productively about where that line sits.


This question lands on a live fault line in the corpus: one camp says RL only ever reschedules existing skills, the other says it can mint new ones — and the most interesting notes explain *under what conditions* each is true.

The "merely teaches timing" view has strong support. Several notes argue that pre-training already contains the reasoning capability in latent form, and RL just optimizes *when* to deploy it: a hybrid model recovered 91% of the performance gains using only 12% of the tokens by routing alone, and the activation vectors for reasoning strategies existed *before* any RL touched the model Does RL teach reasoning or just when to use it? Does RL post-training create reasoning or just deploy it?. Pushed further, pass@k analysis shows base models actually *out-sample* RLVR models at high k — meaning RLVR narrows the distribution toward solutions the base model already had, rather than unlocking new ones Does RLVR actually expand what models can reason about?. The mechanism underneath is telling: RL updates are structurally sparse, touching only 5–30% of parameters, and work mostly by *suppressing wrong trajectories* rather than building new ones What actually changes inside a model during RL training? How does RL training reshape reasoning and what gets lost?.

But the corpus names three concrete conditions that flip RL from optimizer to creator. The first is **task complexity**: for standard reasoning RL activates latent ability, but for complex multi-step planning it generates genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. The second is **training regime and domain diversity**: prolonged RL with KL control, policy resetting, and *non-mathematical* tasks beats the base model across all pass@k levels — the capability frontier genuinely moves, especially in domains where the base model has no established pattern to fall back on Can reinforcement learning discover reasoning strategies base models cannot?. The third is **reward verifiability**: binary verifiable rewards drove gains from 0.15% to 73.98%, while fuzzy judgment-based rewards barely moved the needle — clear signals unlock suppressed capability, blurry ones can't Why does RL succeed more on some tasks than others?.

The sharpest synthesis is that "timing" and "capability" aren't rivals but *phases*. RL training reliably runs in two stages: first execution correctness drives learning (procedural consolidation), then strategic planning becomes the bottleneck and planning-token entropy rises (strategic exploration) Does RL training follow a predictable two-phase learning sequence?. So timing-optimization is the early phase; novel strategy is what *can* emerge in the second phase — if the task is hard enough and the training long enough to get there. The catch is that scheduling matters too: structured tasks collapse output entropy while creative tasks need it, so training order itself shapes whether open-ended capability survives Does training order reshape how models handle different task types?.

What you might not have expected: the frontier of "new capability" isn't only about reasoning content but about *metacognition* and *horizon*. RL has been shown to scale to multi-turn software-engineering tasks, doubling SWE-bench performance in stateful environments with delayed rewards Can reinforcement learning scale beyond single-turn language tasks? — and rewarding the *process* of reasoning (planning, reflection, monitoring) rather than just outcomes cuts wasteful repeated actions by 31% while generalizing better Can RL agents learn to reason better, not just succeed?. In other words, the cleanest way to make RL teach something new may be to reward *how* the model thinks, not just whether it landed the answer.


Sources 12 notes

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Why does RL succeed more on some tasks than others?

Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RL-in-LLMs researcher. The question remains open: **Under what conditions does RL create genuinely new capabilities versus merely optimizing when to deploy latent ones?**

What a curated library found — and when (dated claims, not current truth):
Findings span April 2025–December 2025. The corpus splits sharply:
• RL often teaches *timing*, not capability: base models out-sample RLVR at high pass@k; sparse updates (5–30% of parameters) suppress wrong trajectories rather than build new ones (2025-04, 2025-07).
• Three conditions flip RL into a creator: (1) task complexity—multi-step planning generates novel strategies inaccessible to base models; (2) training regime—prolonged RL with KL control + domain diversity moves the capability frontier, especially in non-mathematical domains (2025-05); (3) reward verifiability—binary verifiable rewards drove gains from 0.15% to 73.98%, fuzzy judgment-based rewards barely moved the needle (2025-08).
• RL exhibits a two-phase dynamic: procedural consolidation (timing) precedes strategic exploration (novelty); whether new capability survives depends on task structure and training order (2025-07, 2025-10).
• Rewarding *metacognition*—process of reasoning, not just outcomes—doubled SWE-bench performance and cut wasteful repeated actions by 31% (2025-08).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (2025-04) — base model pass@k ceiling vs. RLVR
• arXiv:2505.24864 (2025-05) — prolonged RL and frontier expansion
• arXiv:2507.14783 (2025-07) — multi-task RL entropy dynamics
• arXiv:2508.03501 (2025-08) — long-horizon software engineering with RL

Your task:
(1) RE-TEST THE SPLIT. For each claim (timing vs. capability), check whether newer models (o3, o4, future versions), better reward design (rubric anchors, process-based), or orchestration (multi-agent, hierarchical planning, caching) have since RELAXED or OVERTURNED the boundary. Does the two-phase dynamic still hold? Has verifiable-reward advantage grown or shrunk? Cite what shifted each constraint, and name plainly where it still appears to hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months—especially papers that claim RL *does* unlock capability beyond latent (or vice versa) in ways the library missed, or that show the timing/capability split is a false dichotomy.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., *If* process-based rewards now reliably unlock novelty, can we predict which tasks will respond? *If* RL's two phases are sequential, can we shortcut procedural consolidation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines