INQUIRING LINE

When does reinforcement learning actually produce true reasoning gains in models?

This explores the conditions under which RL genuinely extends a model's reasoning ability — versus just making it better at finding answers it already knew — and what the corpus says separates the two.


This explores when RL genuinely extends what a model can reason about, versus when it just sharpens access to abilities the base model already had. The corpus is unusually consistent on the default case: most of the time, RL doesn't teach new reasoning at all. Pass@k analysis shows that base models actually beat their RL-trained versions when you let them sample many answers — meaning RL narrows the model toward solutions already living in its distribution rather than discovering new ones Does RLVR actually expand what models can reason about?. Reinforcement from verifiable rewards (RLVR) works more like a catalyst that surfaces existing capability than a teacher that builds it How does RL training reshape reasoning and what gets lost? What does reward learning actually do to model reasoning?. The most striking evidence: a single training example can be enough to 'activate' the behavior, and even spurious or random rewards work nearly as well as correct ones — which only makes sense if the reasoning was already there waiting to be elicited Do base models already contain hidden reasoning ability?.

So what flips RL from refinement into real gain? One controlled study gives the sharpest answer: RL produces true capability gains only under two conditions together — pretraining has to have already planted the reasoning primitives, and the RL training data has to target tasks right at the edge of what the model can currently do. Miss either, and RL just re-weights sampling When does RL actually extend reasoning beyond pretraining?. Put differently, RL is a deployment optimizer, not a capability creator: it teaches the model *when* to fire its reasoning machinery, not *how* to reason. One hybrid setup recovered 91% of the performance gains using just 12% of the tokens, which is exactly what you'd expect if RL's job is timing and efficiency rather than new skill Does RL teach reasoning or just when to use it?.

There's a dissenting thread worth weighing against this consensus. Some work argues that with simple accuracy rewards alone, sophisticated domain reasoning can *emerge* — medical systems and models like o3 develop complex problem-solving from difficult problems without any chain-of-thought distillation from a teacher Can simple rewards alone teach complex domain reasoning?. The likely reconciliation is the 'headroom' condition again: emergence happens when the difficulty of the problems keeps pushing the model past comfortable territory, so the reward is doing real work rather than rubber-stamping easy wins.

The more interesting frontier is changing *what* the reward measures. Outcome-only rewards leave a lot on the table. Rewarding the reasoning process itself — tagging planning, exploration, and reflection steps and scoring them programmatically — cuts wasteful repeated actions by 31% and generalizes better than supervised fine-tuning Can RL agents learn to reason better, not just succeed?. Using the model's own answer-confidence as the reward signal strengthens step-by-step reasoning while fixing the calibration damage that RLHF usually causes Can model confidence work as a reward signal for reasoning?. And rewarding explanation quality, not just token-level correctness, lets RL embed domain knowledge more durably than SFT Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?.

Two findings reframe the whole question. First, RL training isn't uniform — it moves through two phases: an early one where getting execution correct drives learning, and a later one where strategic planning becomes the bottleneck, and concentrating optimization on planning tokens is where the real late gains come from Does RL training follow a predictable two-phase learning sequence?. Second, if RL mostly elicits rather than creates, the leverage may lie earlier: treating chain-of-thought as an exploratory action *during pretraining*, rewarded by how much it improves prediction, lifts reasoning benchmarks by 19% — planting the capability sooner so later RL has something real to surface Can chain-of-thought reasoning be learned during pretraining itself?. The quiet takeaway across all of this: if you want RL to produce true reasoning gains, the decisive choices are made before RL even starts — in what pretraining left behind and in what your reward actually measures.


Sources 12 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

When does RL actually extend reasoning beyond pretraining?

A controlled synthetic framework shows RL produces true capability gains only when pretraining established reasoning primitives and RL data targets tasks at the boundary of the model's competence. Without these conditions, RL refines sampling rather than extending capability.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Can simple rewards alone teach complex domain reasoning?

Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can reinforcement learning embed domain knowledge more effectively than supervised fine-tuning?

RLAG rewards both answer accuracy and explanation rationality by cycling between augmented and unaugmented generation, progressively internalizing coherent knowledge structures. This outperforms SFT because it prioritizes reasoning quality over token-level correctness.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher stress-testing claims about when reinforcement learning produces genuine reasoning gains in LLMs. The question: does RL expand what models can reason about, or does it mostly surface reasoning already latent in the base model?

What a curated library found — and when (findings span April 2025–December 2025, treat as dated claims):
• Pass@k analysis shows base models outperform RL-trained versions when sampling multiple answers, suggesting RL narrows rather than expands reasoning capability (arXiv:2504.13837, ~2025-04)
• RL produces true gains only when pretraining leaves reasoning primitives *and* RL targets edge-of-capability tasks; otherwise it's just re-weighting the existing distribution (arXiv:2510.07364, ~2025-10)
• Rewarding reasoning process itself (planning, exploration steps) cuts wasteful repetition by 31% and generalizes better than outcome-only rewards (arXiv:2507.22844, ~2025-07)
• RL training exhibits two phases: early procedural execution, then strategic planning; concentrating optimization on planning tokens unlocks late gains (arXiv:2512.07783, ~2025-12)
• Chain-of-thought as pretraining exploratory action (rewarded by information gain) lifts reasoning benchmarks 19% and pre-plants capability for later RL (arXiv:2510.01265, ~2025-09)

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (April 2025): Does RL really incentivize reasoning capacity beyond the base?
• arXiv:2510.07364 (October 2025): Base Models Know How to Reason, Thinking Models Learn When
• arXiv:2507.22844 (July 2025): RLVMR—verifiable meta-reasoning rewards
• arXiv:2512.07783 (December 2025): Interplay of pre-training, mid-training, RL

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, determine whether newer models (o3, o4, frontier frontiers), better reward design (process-based, confidence-as-reward), orchestration (multi-agent, hierarchical RL), or curriculum innovation have since RELAXED the "RL is elicitation not creation" boundary. Separate the durable insight (RL is an optimizer of timing, not capability genesis) from perishable limits (e.g., does emergent reasoning on hard problems truly contradict the headroom thesis, or affirm it?). Cite what resolved any constraint; flag what still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—particularly any claiming RL *does* create novel reasoning, or showing that outcome rewards alone suffice for domain reasoning emergence.
(3) Propose 2 research questions that *assume* the regime may have shifted: e.g., does hierarchical RL with meta-reasoning rewards escape the headroom bottleneck? Can curriculum RL during *pretraining* (not post-training) push the frontier of what counts as planted capability?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines