INQUIRING LINE

What training duration is actually needed for RL to expand capabilities?

This explores how long RL training has to run before it stops merely sharpening what a base model already knows and starts producing genuinely new reasoning — and the corpus suggests duration only matters in combination with task type and training recipe.


This is really a question about *when* RL crosses from polishing to creating — and the corpus says duration alone isn't the lever. The starting point is a genuine disagreement. One camp finds that RLVR never expands what a model can do: pass@k analysis shows base models eventually match or beat RL-trained ones at high sampling, meaning RL just concentrates probability on solutions already latent in the base distribution Does RLVR actually expand what models can reason about? How does RL training reshape reasoning and what gets lost?. In that view, RL teaches *when* to deploy reasoning, not *how* — hybrid models recover 91% of the gains by routing tokens alone, and the reasoning activation vectors exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?.

But the most direct answer to the duration question comes from the work showing that *prolonged* RL does break the boundary — under specific conditions. Long training on diverse, non-mathematical tasks, with KL control and policy resetting, produces models that beat the base model at *every* pass@k level, not just low k Can reinforcement learning discover reasoning strategies base models cannot?. The reconciliation between the two camps is conditional: for standard reasoning, RL activates latent ability quickly; for complex multi-step planning, extended RL generates strategies the base model can't reach even with massive sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. So 'how long' depends entirely on 'long enough to get past procedural mastery into strategic exploration.'

That phrase isn't loose — there's a measurable shape to it. RL training moves through two phases: first it consolidates execution correctness, then strategic planning becomes the bottleneck, with planning-token entropy rising while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. Capability expansion, if it happens, lives in that second phase — so training that stops after phase one will look exactly like 'RL only activates latent skills,' because it does. Underneath, the mechanism is sparse: RL updates only 5–30% of parameters, mostly by suppressing wrong trajectories rather than amplifying right ones What actually changes inside a model during RL training?.

The encouraging part for anyone budgeting compute: the trajectory is predictable. A 400K GPU-hour study across 200+ models found RL performance scales sigmoidally, where the *recipe* sets the ceiling and implementation details only affect how fast you climb — meaning you can extrapolate the asymptote from small runs rather than discovering it the expensive way Does RL training follow predictable scaling curves?. Duration buys you progress along the curve; recipe decides where the curve tops out.

The quieter lesson is that 'duration' is the wrong sole variable. What you train on matters as much as how long: gains track reward verifiability (binary checkable rewards jump from near-zero to ~74%, fuzzy judgment-based ones barely move) Why does RL succeed more on some tasks than others?, training *order* shapes whether entropy collapses and kills open-ended ability Does training order reshape how models handle different task types?, and RL now demonstrably scales to long-horizon multi-turn tasks like software engineering — doubling SWE-bench from 20% to 39% — which only became practical once asynchronous training let generation and learning run without waiting on each other Can reinforcement learning scale beyond single-turn language tasks? Can RL training run while generation continues without waiting?. So the honest answer to 'how long' is: long enough to reach the strategic-exploration phase, on verifiable and complex-enough tasks, with a recipe whose ceiling is worth the climb — short runs on simple tasks will only ever activate what's already there.


Sources 12 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

What actually changes inside a model during RL training?

RL's effects concentrate in structurally sparse but full-rank subnetworks across multiple algorithms and models. Suppressing wrong trajectories—rather than amplifying correct ones—appears to be the primary mechanism, with training following a predictable two-phase pattern of procedural consolidation then strategic exploration.

Does RL training follow predictable scaling curves?

Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.

Why does RL succeed more on some tasks than others?

Binary verifiable rewards enable dramatic RL gains (0.15% to 73.98%), while judgment-based evaluation yields modest improvements (55% reduction). Clear reward signals unlock suppressed capabilities; fuzzy signals barely move the needle.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Can reinforcement learning scale beyond single-turn language tasks?

Modified DAPO training doubled SWE-bench Verified performance from 20% to 39% on Qwen2.5-72B, matching larger models. This demonstrates RL works in stateful multi-step environments with delayed rewards and complex feedback, beyond theoretical single-turn MDPs.

Can RL training run while generation continues without waiting?

AReaL enables continuous generation across workers while training runs on mixed model versions using modified PPO. The system achieves high GPU utilization and handles stale samples effectively, making multi-turn RL practical.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability analyst. The question remains open: **What training duration is actually needed for RL to expand capabilities in LLMs—or does it only activate latent ones?**

What a curated library found — and when (dated claims, not current truth):
Findings span April 2025–December 2025. A curated library identified a live tension:
• RLVR may *not* expand boundaries: base models eventually match RL-trained ones at high pass@k; RL concentrates probability on solutions already latent (2025-04, 2025-10).
• *But* prolonged RL on diverse, non-mathematical tasks with KL control *does* beat base at every pass@k—genuine novel strategies (2025-05).
• The reconciliation is conditional: standard reasoning activates latent ability quickly; complex multi-step planning requires extended RL to escape base model's reach (2025-05).
• RL training exhibits two phases—procedural consolidation, then strategic planning (entropy rise)—and capability expansion lives in phase two only (2025-05, 2025-07).
• RL updates only 5–30% of parameters; performance scales sigmoidally, with recipe setting ceiling, not duration alone (2025-10, 2512).
• Task reward *type* matters as much as duration: binary-verifiable rewards jump to ~74% gains; fuzzy judgment-based rewards stall (2025-08).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (Apr 2025) – Does RL Really Incentivize Reasoning Beyond Base?
• arXiv:2505.24864 (May 2025) – ProRL: Prolonged RL Expands Reasoning Boundaries
• arXiv:2510.13786 (Oct 2025) – Scaling RL Compute for LLMs
• arXiv:2512.07783 (Dec 2025) – Interplay of Pre-, Mid-, and RL on Reasoning

Your task:
(1) **RE-TEST each constraint.** For the latency-activation vs. genuine-expansion split, judge whether models released in the last 6 months (Dec 2025 onward), new RL methods (e.g., negative RL, rubric anchors), or better multi-task scheduling have *shifted the boundary*—i.e., does RL now expand on tasks it previously only activated? Separate the durable question (how long is "long enough"?) from the perishable limitation (which task classes respond to extension). Cite what resolved it.
(2) **Surface the strongest contradicting or superseding work** from last 6 months (July–December 2025). Pay special attention to Dec 2025 work on pre-, mid-, RL interplay and any evidence that phase structure is model-dependent.
(3) **Propose 2 research questions** that assume the regime *has* moved: e.g., "Does the phase-two bottleneck shift as base models improve?" or "Can asynchronous RL (2025-05) collapse the duration gap for complex tasks?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines