SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Model Architecture and Internals Agentic Systems and Tool Use

Does thinking emerge when agents choose between learned sub-policies?

Can we formally understand thinking as the selection of pre-existing sub-policies during reinforcement learning? This explores whether thinking requires new capabilities or just the right conditions to activate what's already there.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

What is thinking, computationally? This paper proposes a minimal formalization: thinking is taking actions that don't directly produce reward or affect the external environment but that lead the agent to take a different, higher-reward course of action. The key construct is a "thought MDP" — a classical MDP extended with explicit thought actions and a controllable thought state.

The central theoretical result is about conditions. Under this formalization, thinking can be viewed as selecting between a set of sub-policies already contained in the agent's policy function. Thought actions are interpretable as the agent choosing to run one or more steps of policy improvement before continuing to act. This means thinking doesn't require new capabilities — it requires a policy initialization rich enough to contain multiple sub-policies worth selecting between.

This reframes DeepSeek-R1's "aha moment" and similar findings. The thinking tokens that emerge during RL training aren't building new reasoning capabilities from scratch. They're learning to select which existing sub-policy to deploy. The rich policy initialization from pre-training provides the raw material; RL provides the selection pressure.

The connection to existing insights is tight. Since Does RL teach reasoning or just when to use it?, the thought MDP provides the formal mechanism: "when to activate" IS "which sub-policy to select." And since Can models learn when to think versus respond quickly?, the thought MDP explains why this works — the model is learning a meta-policy over its own sub-policies.

The deeper philosophical implication is that thinking is not a unitary capability but a structural property that emerges when the right conditions are met: a rich enough policy space, a selection mechanism (RL), and a task structure where delayed action (thinking first) is rewarded. LLMs instantiate these conditions because pre-training provides the policy richness and RL provides the selection pressure.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 126 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

thinking emerges under model-free rl when policy initialization provides sub-policies to select between