INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do different training strategi…›Does reinforcement learning teach…›this inquiring line

AI gets better at reasoning through trial-and-error training — until it stops exploring new approaches, and that's where progress stalls.

What limits RL's ability to scale for reasoning at training time?

This explores the ceilings RL hits when used to improve reasoning during training — why performance plateaus, what mechanisms cause it, and whether those limits are fundamental or just artifacts of how we run RL today.

This explores the ceilings RL runs into when training models to reason better, and the corpus converges on a striking answer: the main limit isn't compute or data, it's *exploration collapse* — RL keeps narrowing toward what already works and stops searching. The clearest version is the idea of a predictable performance ceiling tied to entropy. As training proceeds, the model's policy entropy falls, and once it approaches zero the model stops trying new things; one study fits this to a clean empirical law where reasoning performance saturates as exploration dies out Does policy entropy collapse limit reasoning performance in RL?. A closely related framing calls this "capability boundary collapse": because RLVR trains on the model's own outputs, it rewards exploitation over exploration and can actually *shrink* the range of problems a model can solve Why does RLVR training narrow a model's problem solving ability?.

There's a deeper, more unsettling claim underneath this: maybe RL isn't expanding reasoning at all. Several notes argue base models already contain reasoning ability in latent form, and that post-training merely *selects* and *times* it rather than creating it Do base models already contain hidden reasoning ability?. One reframes RL post-training as a deployment problem — teaching the model *when* to reason, not *how* — pointing out that hybrid models recover most of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. The most pointed evidence: RLVR mostly improves sampling efficiency inside existing boundaries, a single training example can trigger the effect, and even *spurious* rewards work nearly as well as correct ones — which is hard to square with RL teaching genuinely new skills What does reward learning actually do to model reasoning?. If RL only re-weights what's already there, the scaling ceiling is the base model itself.

But the corpus doesn't let that conclusion stand unchallenged, and this is where it gets interesting. Prolonged RL on *diverse, non-mathematical* tasks — with KL control and policy resetting — does beat base models at every pass@k level, suggesting RL can push past the base model's boundaries when the domain lacks pre-baked patterns and exploration is actively protected Can reinforcement learning discover reasoning strategies base models cannot?. So the limit may be less a hard wall than a consequence of how narrowly we usually train. The disagreement in the literature seems to hinge on task diversity and whether the recipe deliberately preserves exploration.

Scale and timing add two more limiting factors that don't reduce to exploration. RL scaling follows sigmoid curves whose *asymptote is set by the recipe*, not the implementation — meaning your training recipe quietly decides the ceiling before you even start, and tweaks only change how fast you reach it Does RL training follow predictable scaling curves?. Training also moves through phases: first execution correctness is the bottleneck, then strategic planning becomes the thing that's hard to improve, so a single uniform RL pressure stops paying off as the binding constraint shifts Does RL training follow a predictable two-phase learning sequence?. And model size matters in a way that can fool you — small models under RL can hit the same accuracy as larger ones through shortcut learning that produces no real, transferable reasoning, a collapse that's invisible unless you inspect the reasoning traces themselves Does reinforcement learning on theory of mind collapse with model scale?.

The constructive thread across all this: the limits are mostly about exploration starving, the wrong reward signal, and the binding constraint moving mid-training — and each has a counter-move. Entropy-preserving interventions keep the policy curious, rewarding metacognition (planning, reflection, monitoring) rather than just outcomes teaches *how* to reason instead of which answer to land on Can RL agents learn to reason better, not just succeed?, and clever reuse of training statistics — like cross-rollout variance doing double duty as both reward signal and query filter — buys stability and speed on the unverifiable tasks where RL usually struggles Can one statistical measure serve dual purposes in RL training?. What you didn't know you wanted to know: the field's central debate over RL's ceiling is really a debate over whether "reasoning" is being *created* or merely *elicited* — and the answer appears to depend entirely on how much room you leave the model to explore.

Sources 11 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Why does RLVR training narrow a model's problem solving ability?

RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Show all 11 sources

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RL training follow predictable scaling curves?

Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-systems analyst tracking RL scaling constraints for LLMs. The core question remains open: what are the *fundamental* limits on reasoning improvement via RL at training time—and are they intrinsic to the method or artifacts of current practice?

What a curated library found—and when (dated claims, not current truth):
Findings span Feb 2025–Oct 2025; treat these as perishable snapshots:
• Policy entropy collapse is the primary bottleneck: as exploration entropy → 0, reasoning performance saturates, following a predictable empirical law (2025-05, arXiv:2505.22617).
• RL often exhibits "capability boundary collapse"—training on the model's own outputs shrinks rather than expands problem-solving scope (2025-07, arXiv:2508.00222).
• Base models may already contain latent reasoning; RL merely *elicits* and *times* it rather than creating novel capability—single examples trigger gains, even spurious rewards work (2025-10, arXiv:2510.07364).
• Prolonged RL on diverse, non-mathematical tasks with KL control and policy resetting *does* beat base models at every pass@k, suggesting exploration-preserving recipes can overcome elicitation-only ceilings (2025-05, arXiv:2505.24864).
• RL scaling follows sigmoid curves whose asymptote is set by the training recipe itself, not implementation; small models hit shortcut reasoning indistinguishable from genuine reasoning without trace inspection (2025-04, arXiv:2504.01698).

Anchor papers (verify; mind their dates):
• arXiv:2505.22617 (May 2025): entropy mechanism as bottleneck
• arXiv:2510.07364 (Oct 2025): elicitation-vs.-creation framing
• arXiv:2505.24864 (May 2025): prolonged RL breaking boundaries
• arXiv:2508.00222 (Jul 2025): capability collapse and hybrid remedies

Your task:
(1) RE-TEST EACH CONSTRAINT. For entropy collapse, policy resetting, and task-diversity effects: do Nov 2025–Jan 2026 preprints report larger exploration windows, entropy-preserving techniques, or curricula that shift the asymptote? Does the elicitation-vs.-creation debate hold, or have recent model-introspection methods settled it? Separate durable open questions (e.g., "how much real reasoning vs. selection?") from perishable limitations (e.g., "entropy always collapses") and cite what resolved them.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months—papers that claim RL *does* scale reasoning without these ceilings, or that reframe them entirely (e.g., compute-optimal schedules, orchestration tricks, new reward formulations).
(3) Propose 2 research questions that assume the regime has moved: (a) If exploration can be preserved, what *other* bottleneck emerges? (b) Can we design RL recipes that deliberately separate "eliciting base reasoning" from "extending it," and does that unlock higher ceilings?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI gets better at reasoning through trial-and-error training — until it stops exploring new approaches, and that's where progress stalls.

Related lines of inquiry

Sources 11 notes

Papers this line draws on 8