INQUIRING LINE

Does RL primarily teach when to use reasoning or how to reason?

This explores a genuine fault line in the corpus — whether reinforcement learning mostly tunes the *timing* of reasoning a model already knows how to do, or whether it actually builds new reasoning ability.


This explores whether RL teaches a model *when* to deploy reasoning or *how* to reason in the first place — and the collection is split, which is the interesting part. The dominant cluster argues for "when." One line of work frames RL post-training as a deployment optimizer: pre-training installs the reasoning capability, and RL just learns to fire it efficiently. The striking evidence is a hybrid model that recovered 91% of the performance gains using only 12% of the tokens, simply by routing — steering *when* the thinking model engages, not teaching it anything new Does RL teach reasoning or just when to use it? Does RL post-training create reasoning or just deploy it?. A complementary finding shows reward learning mostly raises *sampling efficiency* within the base model's existing boundaries: a single training example can suffice to activate a strategy, and even spurious rewards work nearly as well as correct ones — which only makes sense if the skill was already latent What does reward learning actually do to model reasoning?. Push this further and even the optimizer choice stops mattering: PPO, Expert Iteration, and RC-RL perform comparably because the pretrained prior bounds what exploration can reach. RL is selection, not discovery Does the choice of RL algorithm actually matter for reasoning?.

But the corpus doesn't let "when" win cleanly. Prolonged RL — trained with KL control, policy resetting, and tasks outside math where base models lack established patterns — produces models that beat the base across *every* pass@k level, which is the signature of genuinely expanded capability rather than reshuffled sampling Can reinforcement learning discover reasoning strategies base models cannot?. So the answer may hinge on the domain: where the base model already has patterns, RL optimizes deployment; where it doesn't, RL can find new ones.

The more useful reframe is that "how to reason" isn't one thing. RL can improve reasoning by *removing* rather than adding — pruning trajectories that invoke wrong domain facts, which lifted medical reasoning +12.4 points by suppressing bad knowledge rather than teaching new knowledge Does RL improve domain reasoning by adding knowledge or removing it?. And it can teach genuinely new *process* skill when you reward the process directly: structured meta-reasoning tags (planning, exploration, reflection) cut repetitive actions by 31% versus outcome-only rewards Can RL agents learn to reason better, not just succeed?.

What resolves the tension is timing inside a single training run. Across eight models, RL follows a two-phase arc: first it consolidates execution correctness (the "how" of getting steps right), then the bottleneck shifts to strategic planning — *when* and *whether* to explore — with planning-token entropy rising while execution stabilizes Does RL training follow a predictable two-phase learning sequence?. That's why curricula that imitate first and explore second beat either alone: the imitation phase builds reasonable rollouts so the reward signal in the RL phase actually becomes informative Does sequencing imitation then exploration training improve reasoning?.

Here's the thing you might not have known you wanted: the "how" you'd expect RL to teach may largely come from *pre-training* exposure to procedural documents — broad, transferable reasoning patterns absorbed from diverse sources, as opposed to the narrow memorization behind factual recall Does procedural knowledge drive reasoning more than factual retrieval?. If that's right, the whole framing tilts toward "when": the procedural how is laid down early, and RL's job is to decide when to use it — except in the frontier domains where the base model never saw the pattern at all. And scale matters for whether "how" is even real: small models under RL can hit the same accuracy as larger ones through shortcut learning that lacks any interpretable reasoning trace, so a model can look like it learned *how* while having only learned *when to guess* Does reinforcement learning on theory of mind collapse with model scale?.


Sources 11 notes

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does the choice of RL algorithm actually matter for reasoning?

Expert Iteration, PPO, and RC-RL perform comparably on reasoning because exploration is constrained by the pretrained distribution, not the optimizer. RL functions as selection, not discovery—the prior contains most solutions the algorithm will find.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Does RL improve domain reasoning by adding knowledge or removing it?

RL enhances medical reasoning by suppressing incorrect domain knowledge during reasoning—not by expanding what models know. Evidence shows RL achieves +12.4 point knowledge improvement by removing low-reward reasoning trajectories that invoke wrong facts.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating whether RL post-training teaches LLMs *when* to reason or *how* to reason. The question remains open, but the evidence landscape may have shifted.

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Dec 2025. Key constraints that may be perishable:
- RL primarily optimizes *deployment* of pre-existing reasoning; a hybrid model recovered 91% gains using only 12% of tokens via routing, suggesting RL learns *when*, not *how* (2024–25).
- Reward learning raises sampling efficiency within the base model's existing boundaries; optimizer choice (PPO vs. Expert Iteration) becomes interchangeable because the pretrained prior bounds exploration (2024–25).
- In math/established-pattern domains, RL reshuffles latent capability; in frontier domains (medical reasoning, Theory of Mind), RL may discover genuinely novel reasoning paths that beat base at all pass@k levels (2024–25).
- RL training exhibits a two-phase arc: first consolidating execution correctness (*how*), then shifting to strategic planning (*when*); imitation-then-RL curriculum outperforms either alone (2025).
- Small models under RL exhibit scale-dependent reasoning collapse: low-capability models hit accuracy via shortcut learning without interpretable reasoning traces, raising whether "how" exists (2025).

Anchor papers (verify; mind their dates):
- arXiv:2403.04642 *Teaching Large Language Models to Reason with Reinforcement Learning* (2024-03)
- arXiv:2505.24864 *ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries* (2025-05)
- arXiv:2507.22844 *RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards* (2025-07)
- arXiv:2512.07783 *On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models* (2025-12)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the "when vs. how" split: has the two-phase dynamic (consolidation then planning) held across newer model scales, domains, or RL algorithms? Can you ground whether the phase boundary is real or an artifact of hyperparameter choice? Test whether the 91% recovery via routing still holds under stronger baselines or longer RL horizons. Most critically: does the claim that "procedural knowledge comes from pretraining, not RL" persist when you vary pretraining data diversity or RL reward structure?

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look especially for papers claiming RL discovers *genuinely novel process skills* (not routing), or showing that small models under prolonged RL do develop interpretable reasoning chains, or demonstrating that the two-phase arc breaks down under curriculum or reward design that forces *how*-learning from the start.

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If the boundary between "when" and "how" is blurred by domain and scale, what experiment isolates the point at which RL *must* teach new reasoning rather than routing? (b) If pretraining installs most procedural knowledge, what is the minimal RL signal required to teach *genuinely new* process reasoning, and does it differ from the signal needed to optimize deployment?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines