Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
The standard account of thinking models like DeepSeek R1 and GPT-o1 attributes their reasoning gains to RL post-training — a story where RL teaches models how to reason. The "Base Models Know How to Reason" paper inverts this: pre-training is when reasoning capability is acquired; RL teaches models when to deploy it.
The evidence is direct. A hybrid model that combines a base model's reasoning capabilities with a thinking model's deployment decisions — without any weight updates — recovers up to 91% of the performance gap between base and thinking models while steering only 12% of tokens. The steering uses activation-space vectors: directions in the base model that, when added at the right moments, induce reasoning behaviors like backtracking, uncertainty estimation, and subgoal-setting. The thinking model acts as a controller, deciding which steering vectors to activate and when.
This reframes what RL actually does. RL doesn't inject new reasoning skills; it biases token generation toward patterns with high reward. If base models already contain the execution-level skills (which they demonstrably do — sampled sufficiently, they produce reasoning traces already present in thinking model outputs), RL is essentially training an attention-based curriculum: produce the right reasoning at the right moment.
The implications are uncomfortable for the RL-is-essential narrative. Reasoning capability is largely a pre-training phenomenon. RL is a deployment optimizer, not a capability creator. This connects to Can prompt optimization teach models knowledge they lack? — the same principle operating at the training/inference boundary rather than purely at inference time.
Three RLVR findings reinforce this: First, pass@k analysis shows RLVR models have narrower capability boundaries than base models — at high k, base models outperform all six tested RLVR algorithms. RLVR is a sampling efficiency optimizer, not a capability expander. See Does RLVR actually expand what models can reason about?. Second, 1-shot RLVR achieves a 37-point jump (MATH500 36%→73.6%) from a single training example, with generalization continuing for 1,400 steps after the model perfectly memorizes its one example. The data is exhausted but activation continues — because the training signal triggers a phase transition in the model's output distribution. See Can a single training example unlock mathematical reasoning?. Third, spurious rewards (random, incorrect, or format-only) improve Qwen2.5-Math nearly as much as ground-truth rewards — but fail for Llama and OLMo. The differentiating variable is pretraining strategy, not reward signal quality. See Why do random rewards improve reasoning for some models but not others?.
The practical implication for reasoning system design: targeted steering of base models may be a more efficient path to reasoning performance than full RL training, particularly for domains where RLVR reward signal is hard to define.
Inquiring lines that use this note as a source 14
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can RL teach when to use reasoning versus when to respond directly?
- Can RL training teach models when to activate reasoning versus when to skip it?
- How do reasoning training methods sacrifice some thinking skills while improving others?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- Does RL training actually restore the critical thinking that reasoning models lose?
- Does RL teach models when to use reasoning or how to reason?
- Can one training example activate mathematical reasoning in RL-trained models?
- What is the distinction between teaching reasoning how versus when to activate?
- Can one training example activate mathematical reasoning without reinforcement learning?
- Does reinforcement learning teach models how to reason or when to reason?
- Does RL primarily teach when to use reasoning or how to reason?
- What does RL post-training actually teach reasoning systems?
- Can we predict when a model will develop thinking behaviors?
- When does reinforcement learning actually produce true reasoning gains in models?
Related concepts in this collection 14
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
challenges this: if base models already have the capability, RL is not an emergence engine but an activation scheduler
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
consistent: RL shapes *which* capabilities get expressed, not their existence
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
partially qualified: base models can close most of the gap with targeted activation, changing what "non-reasoning model" means
-
Can prompt optimization teach models knowledge they lack?
Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.
extends to training-time dynamics
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
pass@k evidence: RLVR narrows scope to reliable subset
-
Can a single training example unlock mathematical reasoning?
Explores whether one example is enough to dramatically improve math problem-solving in language models, and whether learning continues after perfect memorization.
1-shot activation: minimal signal triggers phase transition
-
Why do random rewards improve reasoning for some models but not others?
When RLVR training uses meaningless reward signals, some models gain reasoning improvements while others don't. What determines which models can benefit from optimization pressure without meaningful feedback?
pretraining determines RLVR effectiveness, not reward quality
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
if RL is narrowing to a deployment-timing policy rather than building capability, entropy collapse is the natural consequence: the model converges on a single activation schedule and loses the diversity of timing strategies that would sustain continued improvement
-
Can models learn to internalize search algorithms through training?
Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.
challenges the when-not-how framing: Meta-CoT proposes that search algorithms ARE trainable as the "how" component, suggesting RL may operate at two levels — timing (when to reason) and search internalization (how to reason)
-
Does reinforcement learning on theory of mind collapse with model scale?
When RL improves social reasoning, does the quality of reasoning depend on model size? The question matters because accuracy alone may hide whether models are actually thinking or just pattern-matching.
adds a capacity caveat: RL teaches when-not-how only when the model has sufficient latent capability; below a scale threshold in social reasoning, RL teaches shortcuts instead of activation timing
-
Can models improve themselves on tasks without verifiable answers?
Most self-improvement methods require verifiable correctness signals like math or code. Can models improve on open-ended instruction tasks where right answers aren't automatically checkable? And what minimal training is needed to unlock this?
catalyst data reinforces the when-not-how thesis through a different mechanism: 1000 demonstrations teach the model to enrich its reasoning output format, not to reason; the small data requirement confirms the capability is latent and the catalyst is an activation signal for reasoning articulation, not reasoning itself
-
Can next-token prediction become a reasoning task with RL?
Does reinforcement learning applied to next-token prediction during pretraining encourage genuine reasoning rather than surface memorization? This matters because it could unlock reasoning capability without requiring labeled data or human feedback.
RPT enriches the latent capability that post-training activates: by embedding RL reasoning patterns during pretraining itself, RPT creates a richer foundation for the "when" decision that post-training teaches
-
Does thinking emerge when agents choose between learned sub-policies?
Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
provides the formal mechanism: the thought MDP formalizes "when to activate" as sub-policy selection within a rich policy initialization; thinking is choosing which existing sub-policy to deploy, not building new capability
-
When does RL actually extend reasoning beyond pretraining?
Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.
controlled-experiment evidence for the precondition: RL activates/extends only the primitives pretraining laid down, and only with headroom + edge-of-competence data
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Eliciting Reasoning in Language Models with Cognitive Tools
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Base Models Know How to Reason, Thinking Models Learn When
- ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models
- Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning
- Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- The Invisible Leash: Why RLVR May Not Escape Its Origin
Original note title
rl post-training teaches models when to activate reasoning mechanisms not how to execute them