INQUIRING LINE

Does RL teach models when to use reasoning or how to reason?

This explores whether reinforcement learning builds new reasoning ability in a model or mainly teaches it when to deploy reasoning it already has — and where the corpus splits on that question.


This explores whether RL builds new reasoning ability or mainly teaches a model when to deploy reasoning it already has. The corpus leans hard toward the second answer — but not unanimously, and the disagreement is the interesting part.

The dominant finding is that RL teaches *when*, not *how*. Base models appear to already carry reasoning strategies in latent form, and RL post-training optimizes the timing of when to fire them rather than creating them Does RL post-training create reasoning or just deploy it? Does RL teach reasoning or just when to use it?. The striking evidence: a hybrid model that borrows reasoning from the base model and only lets a thinking model decide *which* tokens to route recovered 91% of the performance gains using just 12% of the tokens — implying RL is acting as a deployment optimizer, not a capability creator. Mechanistic work backs this up: five independent techniques (RL steering, critique fine-tuning, decoding tweaks, SAE feature steering, RLVR) all elicit reasoning already sitting in base-model activations, suggesting the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?.

The RLVR literature sharpens the same point from a different angle. Reward learning seems to activate pretraining strategies rather than teach new ones — a single training example can be enough to trigger it, and even spurious rewards work nearly as well as correct ones for a well-pretrained model What does reward learning actually do to model reasoning?. Pass@k analysis is the clincher here: base models actually *beat* RLVR models at high k, meaning RL narrows sampling toward solutions already in the base distribution rather than expanding what's solvable Does RLVR actually expand what models can reason about?. By that account, RL is teaching neither when nor how so much as *which answer to commit to faster*.

But the corpus does not let the "when, not how" story win cleanly. Prolonged RL on diverse, non-mathematical tasks — with KL control and policy resetting — produced models that outperform the base across *all* pass@k levels, which is exactly the signature of genuinely expanded capability, not just better sampling Can reinforcement learning discover reasoning strategies base models cannot?. The reconciliation may be domain-dependent: RL re-routes existing skill where the base model already has established patterns (math), but can discover new strategy where it doesn't. There's also a third framing the question doesn't anticipate — RL teaching *how to reason about reasoning*. Process rewards on metacognitive tags (planning, exploration, reflection) cut repetitive actions by 31% and generalize better, which is closer to shaping the reasoning process itself than to timing it Can RL agents learn to reason better, not just succeed?.

Worth pulling on if you go further: the whole debate may rest on where reasoning comes from in the first place. Analysis of five million pretraining documents found that reasoning generalization is driven by broad, transferable *procedural* knowledge — not the narrow fact-memorization behind recall Does procedural knowledge drive reasoning more than factual retrieval?. If the procedures are laid down in pretraining, then "RL teaches when, not how" is almost a corollary. And scale matters for whether RL produces real reasoning at all: on theory-of-mind tasks, larger models develop genuine transferable belief-tracking under RL while smaller ones hit the same accuracy through shortcut learning with no interpretable trace — a reminder that matching accuracy can hide whether any reasoning was learned Does reinforcement learning on theory of mind collapse with model scale?.


Sources 9 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Does RL teach reasoning or just when to use it?

Pre-training acquires reasoning capability; RL teaches efficient deployment. A hybrid model combining base reasoning with thinking model steering recovered 91% of performance gains using only 12% of tokens, suggesting RL acts as a deployment optimizer rather than a capability creator.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can reinforcement learning discover reasoning strategies base models cannot?

RL-trained models outperform base models across all pass@k levels when trained with KL control, policy resetting, and non-mathematical tasks. This shows RL can expand capability boundaries, not just optimize sampling efficiency, especially on domains where base models lack established patterns.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Does reinforcement learning on theory of mind collapse with model scale?

7B models develop explicit, transferable belief-tracking under RL, while smaller models achieve comparable accuracy through shortcut learning that lacks interpretable reasoning traces. The mismatch between accuracy and reasoning quality is invisible without inspecting step-by-step outputs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking whether RL in LLMs builds new reasoning *capacity* or optimizes *deployment* of latent reasoning. The question remains open despite recent work claiming settlement.

What a curated library found — and when (dated claims, not current truth): Findings span Nov 2024–Dec 2025.
• RL teaches *when* to use reasoning, not *how* to reason: hybrid routing models recover 91% of RL gains using only 12% of thinking tokens, suggesting RL is a deployment optimizer (2025-10).
• Base models already possess latent reasoning; five independent steering techniques (SAE, RLVR, critique) all elicit pre-existing strategies without expanding capability boundaries (2025-04, 2025-07).
• Pass@k analysis: base models outperform RLVR at high k, implying RL narrows sampling toward existing solutions rather than expanding what's solvable (2025-07).
• *Contradiction*: prolonged RL on diverse, non-math tasks with KL control produced genuine capability gains across all pass@k levels, matching the signature of expanded reasoning (2025-05).
• Process rewards on metacognitive tags (planning, reflection) cut repetitive actions by 31% and generalize better, suggesting RL can shape *how* reasoning unfolds, not just when (2025-08).

Anchor papers (verify; mind their dates):
• arXiv:2510.07364 (Oct 2025) — "Base Models Know How to Reason, Thinking Models Learn When"
• arXiv:2505.24864 (May 2025) — ProRL on prolonged RL and genuine reasoning expansion
• arXiv:2507.14843 (Jul 2025) — "The Invisible Leash" on RLVR capability limits
• arXiv:2512.07783 (Dec 2025) — Interplay of pretraining, mid-training, RL on reasoning

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the "when not how" claim, interrogate whether newer model scales (o1-level reasoning models), longer-horizon RL, process-reward training on metacognitive tasks, or multi-agent orchestration have since *blurred or dissolved* the boundary between deployment optimization and genuine capability discovery. Does the distinction hold at current frontier scales? Cite what has shifted it, and where it still appears durable.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months.** The prolonged-RL result (2025-05) and metacognitive-rewards result (2025-08) both appear to contradict the "RL teaches only when" consensus. Hunt for January–March 2026 papers that either reconcile these threads or tip the balance decisively one way.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** (a) Has the math/reasoning dichotomy (where RL teaches timing on math but discovers strategy on other tasks) held under larger models and longer RL horizons, or has it collapsed? (b) Can we operationalize a test that distinguishes *eliciting latent reasoning* from *synthesizing novel reasoning* that doesn't rely on pass@k proxy measures?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines