When does reinforcement learning actually produce true reasoning gains in models?
This explores the conditions under which RL genuinely extends a model's reasoning ability — versus just making it better at finding answers it already knew — and what the corpus says separates the two.
This explores when RL genuinely extends what a model can reason about, versus when it just sharpens access to abilities the base model already had. The corpus is unusually consistent on the default case: most of the time, RL doesn't teach new reasoning at all. Pass@k analysis shows that base models actually beat their RL-trained versions when you let them sample many answers — meaning RL narrows the model toward solutions already living in its distribution rather than discovering new ones Does RLVR actually expand what models can reason about?. Reinforcement from verifiable rewards (RLVR) works more like a catalyst that surfaces existing capability than a teacher that builds it How does RL training reshape reasoning and what gets lost? What does reward learning actually do to model reasoning?. The most striking evidence: a single training example can be enough to 'activate' the behavior, and even spurious or random rewards work nearly as well as correct ones — which only makes sense if the reasoning was already there waiting to be elicited Do base models already contain hidden reasoning ability?.
So what flips RL from refinement into real gain? One controlled study gives the sharpest answer: RL produces true capability gains only under two conditions together — pretraining has to have already planted the reasoning primitives, and the RL training data has to target tasks right at the edge of what the model can currently do. Miss either, and RL just re-weights sampling When does RL actually extend reasoning beyond pretraining?. Put differently, RL is a deployment optimizer, not a capability creator: it teaches the model *when* to fire its reasoning machinery, not *how* to reason. One hybrid setup recovered 91% of the performance gains using just 12% of the tokens, which is exactly what you'd expect if RL's job is timing and efficiency rather than new skill Does RL teach reasoning or just when to use it?.
There's a dissenting thread worth weighing against this consensus. Some work argues that with simple accuracy rewards alone, sophisticated domain reasoning can *emerge* — medical systems and models like o3 develop complex problem-solving from difficult problems without any chain-of-thought distillation from a teacher Can simple rewards alone teach complex domain reasoning?. The likely reconciliation is the 'headroom' condition again: emergence happens when the difficulty of the problems keeps pushing the model past comfortable territory, so the reward is doing real work rather than rubber-stamping easy wins.
The more interesting frontier is changing *what* the reward measures. Outcome-only rewards leave a lot on the table. Rewarding the reasoning process itself — tagging planning, exploration, and reflection steps and scoring them programmatically — cuts wasteful repeated actions by 31% and generalizes better than supervised fine-tuning Can RL agents learn to reason better, not just succeed?. Using the model's own answer-confidence as the reward signal strengthens step-by-step reasoning while fixing the calibration damage that RLHF usually causes Can model confidence work as a reward signal for reasoning?. And rewarding explanation quality, not just token-level correctness, lets RL embed domain knowledge more durably than SFT Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.
Two findings reframe the whole question. First, RL training isn't uniform — it moves through two phases: an early one where getting execution correct drives learning, and a later one where strategic planning becomes the bottleneck, and concentrating optimization on planning tokens is where the real late gains come from Does RL training follow a predictable two-phase learning sequence?. Second, if RL mostly elicits rather than creates, the leverage may lie earlier: treating chain-of-thought as an exploratory action *during pretraining*, rewarded by how much it improves prediction, lifts reasoning benchmarks by 19% — planting the capability sooner so later RL has something real to surface Can chain-of-thought reasoning be learned during pretraining itself?. The quiet takeaway across all of this: if you want RL to produce true reasoning gains, the decisive choices are made before RL even starts — in what pretraining left behind and in what your reward actually measures.
Sources 12 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
A controlled synthetic framework shows RL produces true capability gains only when pretraining established reasoning primitives and RL data targets tasks at the boundary of the model's competence. Without these conditions, RL refines sampling rather than extending capability.
Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.
Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.
RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.