SYNTHESIS NOTE

Does thinking emerge when agents choose between learned sub-policies?

Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning

What is thinking, computationally? This paper proposes a minimal formalization: thinking is taking actions that don't directly produce reward or affect the external environment but that lead the agent to take a different, higher-reward course of action. The key construct is a "thought MDP" — a classical MDP extended with explicit thought actions and a controllable thought state.

The central theoretical result is about conditions. Under this formalization, thinking can be viewed as selecting between a set of sub-policies already contained in the agent's policy function. Thought actions are interpretable as the agent choosing to run one or more steps of policy improvement before continuing to act. This means thinking doesn't require new capabilities — it requires a policy initialization rich enough to contain multiple sub-policies worth selecting between.

This reframes DeepSeek-R1's "aha moment" and similar findings. The thinking tokens that emerge during RL training aren't building new reasoning capabilities from scratch. They're learning to select which existing sub-policy to deploy. The rich policy initialization from pre-training provides the raw material; RL provides the selection pressure.

The connection to existing insights is tight. Since Does RL teach reasoning or just when to use it?, the thought MDP provides the formal mechanism: "when to activate" IS "which sub-policy to select." And since Can models learn when to think versus respond quickly?, the thought MDP explains why this works — the model is learning a meta-policy over its own sub-policies.

The deeper philosophical implication is that thinking is not a unitary capability but a structural property that emerges when the right conditions are met: a rich enough policy space, a selection mechanism (RL), and a task structure where delayed action (thinking first) is rewarded. LLMs instantiate these conditions because pre-training provides the policy richness and RL provides the selection pressure.

Inquiring lines that read this note 5

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do base models contain latent reasoning that training can unlock?

How does AI assistance affect human cognitive development and reasoning autonomy?

How do agents decide when to pause and reflect on their strategy?

How does latent reasoning compare to verbalized chain-of-thought?

How do thought actions represent policy improvement steps in practice?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 128 in 2-hop network ·dense cluster Open in graph ↗

Does thinking emerge when agents choose between … Does RL teach reasoning or just when to use it? Can models learn when to think versus respond quic… Do base models already contain hidden reasoning ab… Does RL training follow a predictable two-phase le…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
formalizes: the thought MDP gives the mathematical structure for "when to activate" as sub-policy selection
Can models learn when to think versus respond quickly? Explores whether a single language model can adaptively choose between extended reasoning and direct responses based on task difficulty. This matters because it could make inference more efficient by allocating compute only when needed.
instantiates: the meta-policy over thinking vs. concise response IS thought action selection
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
explains: latent capability = sub-policies in the policy initialization
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
connects: the planning phase is when the model learns to use thought actions effectively

Does thinking emerge when agents choose between learned sub-policies?

Inquiring lines that read this note 5

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4