Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

The standard account of thinking models like DeepSeek R1 and GPT-o1 attributes their reasoning gains to RL post-training — a story where RL teaches models how to reason. The "Base Models Know How to Reason" paper inverts this: pre-training is when reasoning capability is acquired; RL teaches models when to deploy it.

The evidence is direct. A hybrid model that combines a base model's reasoning capabilities with a thinking model's deployment decisions — without any weight updates — recovers up to 91% of the performance gap between base and thinking models while steering only 12% of tokens. The steering uses activation-space vectors: directions in the base model that, when added at the right moments, induce reasoning behaviors like backtracking, uncertainty estimation, and subgoal-setting. The thinking model acts as a controller, deciding which steering vectors to activate and when.

This reframes what RL actually does. RL doesn't inject new reasoning skills; it biases token generation toward patterns with high reward. If base models already contain the execution-level skills (which they demonstrably do — sampled sufficiently, they produce reasoning traces already present in thinking model outputs), RL is essentially training an attention-based curriculum: produce the right reasoning at the right moment.

The implications are uncomfortable for the RL-is-essential narrative. Reasoning capability is largely a pre-training phenomenon. RL is a deployment optimizer, not a capability creator. This connects to Can prompt optimization teach models knowledge they lack? — the same principle operating at the training/inference boundary rather than purely at inference time.

Three RLVR findings reinforce this: First, pass@k analysis shows RLVR models have narrower capability boundaries than base models — at high k, base models outperform all six tested RLVR algorithms. RLVR is a sampling efficiency optimizer, not a capability expander. See Does RLVR actually expand what models can reason about?. Second, 1-shot RLVR achieves a 37-point jump (MATH500 36%→73.6%) from a single training example, with generalization continuing for 1,400 steps after the model perfectly memorizes its one example. The data is exhausted but activation continues — because the training signal triggers a phase transition in the model's output distribution. See Can a single training example unlock mathematical reasoning?. Third, spurious rewards (random, incorrect, or format-only) improve Qwen2.5-Math nearly as much as ground-truth rewards — but fail for Llama and OLMo. The differentiating variable is pretraining strategy, not reward signal quality. See Why do random rewards improve reasoning for some models but not others?.

The practical implication for reasoning system design: targeted steering of base models may be a more efficient path to reasoning performance than full RL training, particularly for domains where RLVR reward signal is hard to define.

Inquiring lines that read this note 14

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does reinforcement learning teach reasoning or just when to reason?

How do training data properties shape reasoning capability development?

How do reasoning training methods sacrifice some thinking skills while improving others?

Do base models contain latent reasoning that training can unlock?

Related concepts in this collection 14

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

30 direct connections · 241 in 2-hop network ·medium cluster Open in graph ↗

Does RL teach reasoning or just when to use it? Can simple rewards alone teach complex domain reas… Does RL improve domain reasoning by adding knowled… Can non-reasoning models catch up with more comput… Can prompt optimization teach models knowledge the… Does RLVR actually expand what models can reason a… Can a single training example unlock mathematical … Why do random rewards improve reasoning for some m… Does policy entropy collapse limit reasoning perfo…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can simple rewards alone teach complex domain reasoning? Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
challenges this: if base models already have the capability, RL is not an emergence engine but an activation scheduler
Does RL improve domain reasoning by adding knowledge or removing it? When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
consistent: RL shapes *which* capabilities get expressed, not their existence
Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
partially qualified: base models can close most of the gap with targeted activation, changing what "non-reasoning model" means
Can prompt optimization teach models knowledge they lack? Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
extends to training-time dynamics
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
pass@k evidence: RLVR narrows scope to reliable subset
Can a single training example unlock mathematical reasoning? Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
1-shot activation: minimal signal triggers phase transition
Why do random rewards improve reasoning for some models but not others? When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?
pretraining determines RLVR effectiveness, not reward quality
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
if RL is narrowing to a deployment-timing policy rather than building capability, entropy collapse is the natural consequence: the model converges on a single activation schedule and loses the diversity of timing strategies that would sustain continued improvement
Can models learn to internalize search algorithms through training? Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.
challenges the when-not-how framing: Meta-CoT proposes that search algorithms ARE trainable as the "how" component, suggesting RL may operate at two levels — timing (when to reason) and search internalization (how to reason)
Does reinforcement learning on theory of mind collapse with model scale? When RL improves social reasoning, does the quality of reasoning depend on model size? The question matters because accuracy alone may hide whether models are actually thinking or just pattern-matching.
adds a capacity caveat: RL teaches when-not-how only when the model has sufficient latent capability; below a scale threshold in social reasoning, RL teaches shortcuts instead of activation timing
Can models improve themselves on tasks without verifiable answers? Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
catalyst data reinforces the when-not-how thesis through a different mechanism: 1000 demonstrations teach the model to enrich its reasoning output format, not to reason; the small data requirement confirms the capability is latent and the catalyst is an activation signal for reasoning articulation, not reasoning itself
Can next-token prediction become a reasoning task with RL? Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
RPT enriches the latent capability that post-training activates: by embedding RL reasoning patterns during pretraining itself, RPT creates a richer foundation for the "when" decision that post-training teaches
Does thinking emerge when agents choose between learned sub-policies? Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
provides the formal mechanism: the thought MDP formalizes "when to activate" as sub-policy selection within a rich policy initialization; thinking is choosing which existing sub-policy to deploy, not building new capability
When does RL actually extend reasoning beyond pretraining? Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.
controlled-experiment evidence for the precondition: RL activates/extends only the primitives pretraining laid down, and only with headroom + edge-of-competence data

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl post-training teaches models when to activate reasoning mechanisms not how to execute them

Does RL teach reasoning or just when to use it?

Inquiring lines that read this note 14

Related concepts in this collection 14

Related papers in this collection 8

Search by related questions 4