What limits RL's ability to scale for reasoning at training time?
This explores the ceilings RL hits when used to improve reasoning during training — why performance plateaus, what mechanisms cause it, and whether those limits are fundamental or just artifacts of how we run RL today.
This explores the ceilings RL runs into when training models to reason better, and the corpus converges on a striking answer: the main limit isn't compute or data, it's *exploration collapse* — RL keeps narrowing toward what already works and stops searching. The clearest version is the idea of a predictable performance ceiling tied to entropy. As training proceeds, the model's policy entropy falls, and once it approaches zero the model stops trying new things; one study fits this to a clean empirical law where reasoning performance saturates as exploration dies out Does policy entropy collapse limit reasoning performance in RL?. A closely related framing calls this "capability boundary collapse": because RLVR trains on the model's own outputs, it rewards exploitation over exploration and can actually *shrink* the range of problems a model can solve Why does RLVR training narrow a model's problem solving ability?.
There's a deeper, more unsettling claim underneath this: maybe RL isn't expanding reasoning at all. Several notes argue base models already contain reasoning ability in latent form, and that post-training merely *selects* and *times* it rather than creating it Do base models already contain hidden reasoning ability?. One reframes RL post-training as a deployment problem — teaching the model *when* to reason, not *how* — pointing out that hybrid models recover most of the gains just by routing tokens Does RL post-training create reasoning or just deploy it?. The most pointed evidence: RLVR mostly improves sampling efficiency inside existing boundaries, a single training example can trigger the effect, and even *spurious* rewards work nearly as well as correct ones — which is hard to square with RL teaching genuinely new skills What does reward learning actually do to model reasoning?. If RL only re-weights what's already there, the scaling ceiling is the base model itself.
But the corpus doesn't let that conclusion stand unchallenged, and this is where it gets interesting. Prolonged RL on *diverse, non-mathematical* tasks — with KL control and policy resetting — does beat base models at every pass@k level, suggesting RL can push past the base model's boundaries when the domain lacks pre-baked patterns and exploration is actively protected Can reinforcement learning discover reasoning strategies base models cannot?. So the limit may be less a hard wall than a consequence of how narrowly we usually train. The disagreement in the literature seems to hinge on task diversity and whether the recipe deliberately preserves exploration.
Scale and timing add two more limiting factors that don't reduce to exploration. RL scaling follows sigmoid curves whose *asymptote is set by the recipe*, not the implementation — meaning your training recipe quietly decides the ceiling before you even start, and tweaks only change how fast you reach it Does RL training follow predictable scaling curves?. Training also moves through phases: first execution correctness is the bottleneck, then strategic planning becomes the thing that's hard to improve, so a single uniform RL pressure stops paying off as the binding constraint shifts Does RL training follow a predictable two-phase learning sequence?. And model size matters in a way that can fool you — small models under RL can hit the same accuracy as larger ones through shortcut learning that produces no real, transferable reasoning, a collapse that's invisible unless you inspect the reasoning traces themselves Does reinforcement learning on theory of mind collapse with model scale?.
The constructive thread across all this: the limits are mostly about exploration starving, the wrong reward signal, and the binding constraint moving mid-training — and each has a counter-move. Entropy-preserving interventions keep the policy curious, rewarding metacognition (planning, reflection, monitoring) rather than just outcomes teaches *how* to reason instead of which answer to land on Can RL agents learn to reason better, not just succeed?, and clever reuse of training statistics — like cross-rollout variance doing double duty as both reward signal and query filter — buys stability and speed on the unverifiable tasks where RL usually struggles Can one statistical measure serve dual purposes in RL training?. What you didn't know you wanted to know: the field's central debate over RL's ceiling is really a debate over whether "reasoning" is being *created* or merely *elicited* — and the answer appears to depend entirely on how much room you leave the model to explore.
Sources 11 notes
Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.
RLVR narrows models' problem-solving scope by prioritizing exploitation over exploration, a phenomenon called capability boundary collapse. Multiple importance sampling with exploration-based advantage functions can counteract this by integrating external data and explicitly rewarding discovery of underexplored but valuable reasoning paths.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.
Large-scale study (400K GPU-hours, 200+ models) shows RL performance scales sigmoidally. Recipe choices set the ceiling; implementation details only affect efficiency. Stable recipes enable reliable extrapolation from small runs.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.
DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.