SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Does RL teach reasoning or just when to use it?

Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.

Synthesis note · 2026-02-22 · sourced from Reasoning Architectures

The standard account of thinking models like DeepSeek R1 and GPT-o1 attributes their reasoning gains to RL post-training — a story where RL teaches models how to reason. The "Base Models Know How to Reason" paper inverts this: pre-training is when reasoning capability is acquired; RL teaches models when to deploy it.

The evidence is direct. A hybrid model that combines a base model's reasoning capabilities with a thinking model's deployment decisions — without any weight updates — recovers up to 91% of the performance gap between base and thinking models while steering only 12% of tokens. The steering uses activation-space vectors: directions in the base model that, when added at the right moments, induce reasoning behaviors like backtracking, uncertainty estimation, and subgoal-setting. The thinking model acts as a controller, deciding which steering vectors to activate and when.

This reframes what RL actually does. RL doesn't inject new reasoning skills; it biases token generation toward patterns with high reward. If base models already contain the execution-level skills (which they demonstrably do — sampled sufficiently, they produce reasoning traces already present in thinking model outputs), RL is essentially training an attention-based curriculum: produce the right reasoning at the right moment.

The implications are uncomfortable for the RL-is-essential narrative. Reasoning capability is largely a pre-training phenomenon. RL is a deployment optimizer, not a capability creator. This connects to Can prompt optimization teach models knowledge they lack? — the same principle operating at the training/inference boundary rather than purely at inference time.

Three RLVR findings reinforce this: First, pass@k analysis shows RLVR models have narrower capability boundaries than base models — at high k, base models outperform all six tested RLVR algorithms. RLVR is a sampling efficiency optimizer, not a capability expander. See Does RLVR actually expand what models can reason about?. Second, 1-shot RLVR achieves a 37-point jump (MATH500 36%→73.6%) from a single training example, with generalization continuing for 1,400 steps after the model perfectly memorizes its one example. The data is exhausted but activation continues — because the training signal triggers a phase transition in the model's output distribution. See Can a single training example unlock mathematical reasoning?. Third, spurious rewards (random, incorrect, or format-only) improve Qwen2.5-Math nearly as much as ground-truth rewards — but fail for Llama and OLMo. The differentiating variable is pretraining strategy, not reward signal quality. See Why do random rewards improve reasoning for some models but not others?.

The practical implication for reasoning system design: targeted steering of base models may be a more efficient path to reasoning performance than full RL training, particularly for domains where RLVR reward signal is hard to define.

Inquiring lines that use this note as a source 14

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 14

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
30 direct connections · 233 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rl post-training teaches models when to activate reasoning mechanisms not how to execute them