Does thinking emerge when agents choose between learned sub-policies?
Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.
What is thinking, computationally? This paper proposes a minimal formalization: thinking is taking actions that don't directly produce reward or affect the external environment but that lead the agent to take a different, higher-reward course of action. The key construct is a "thought MDP" — a classical MDP extended with explicit thought actions and a controllable thought state.
The central theoretical result is about conditions. Under this formalization, thinking can be viewed as selecting between a set of sub-policies already contained in the agent's policy function. Thought actions are interpretable as the agent choosing to run one or more steps of policy improvement before continuing to act. This means thinking doesn't require new capabilities — it requires a policy initialization rich enough to contain multiple sub-policies worth selecting between.
This reframes DeepSeek-R1's "aha moment" and similar findings. The thinking tokens that emerge during RL training aren't building new reasoning capabilities from scratch. They're learning to select which existing sub-policy to deploy. The rich policy initialization from pre-training provides the raw material; RL provides the selection pressure.
The connection to existing insights is tight. Since Does RL teach reasoning or just when to use it?, the thought MDP provides the formal mechanism: "when to activate" IS "which sub-policy to select." And since Can models learn when to think versus respond quickly?, the thought MDP explains why this works — the model is learning a meta-policy over its own sub-policies.
The deeper philosophical implication is that thinking is not a unitary capability but a structural property that emerges when the right conditions are met: a rich enough policy space, a selection mechanism (RL), and a task structure where delayed action (thinking first) is rewarded. LLMs instantiate these conditions because pre-training provides the policy richness and RL provides the selection pressure.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do emergent abilities result from genuine new capabilities or implicit in-context learning?
- How does policy initialization with sub-policies enable emergent thinking?
- How do agents decide when to pause and reflect on their strategy?
- Why does pre-training provide the raw material for emergent thinking?
- How do thought actions represent policy improvement steps in practice?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
formalizes: the thought MDP gives the mathematical structure for "when to activate" as sub-policy selection
-
Can models learn when to think versus respond quickly?
Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
instantiates: the meta-policy over thinking vs. concise response IS thought action selection
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
explains: latent capability = sub-policies in the policy initialization
-
Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
connects: the planning phase is when the model learns to use thought actions effectively
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reinforcement Learning be Enough for Thinking?
- Fast, Slow, and Tool-augmented Thinking for LLMs: A Review
- Base Models Know How to Reason, Thinking Models Learn When
- Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
- SAND: Boosting LLM Agents with Self-Taught Action Deliberation
- RLP: Reinforcement as a Pretraining Objective
- Eliciting Reasoning in Language Models with Cognitive Tools
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
Original note title
thinking emerges under model-free rl when policy initialization provides sub-policies to select between