SYNTHESIS NOTE

Does RL training follow a predictable two-phase learning sequence?

This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.

Synthesis note · 2026-02-22 · sourced from Reinforcement Learning

Across eight text-only and vision-language models, RL training reveals a consistently two-phase dynamic. In the first phase, the learning bottleneck is procedural correctness — a single calculation error invalidates an entire solution, creating powerful gradient signal that compels mastery of low-level execution tokens (arithmetic, variable substitution, formula application). In the second phase, the bottleneck shifts to strategic planning — exploring and mastering high-level planning tokens (deduction like "we can use the fact that," branching like "let's try a different approach," backtracing like "but the problem mentions that").

The phases are not mutually exclusive. Procedural refinement continues throughout training. But the primary driver of marginal performance gains shifts to strategic planning. This is why the "aha moment" phenomenon appears when it does — it represents the discovery and internalization of high-level reasoning strategies, which only becomes the active learning frontier after procedural skills are consolidated.

The entropy dynamics tell the same story. Planning tokens show increasing strategic diversification over training — the model explores new ways to combine established skills. Execution tokens show stable conditional entropy — once arithmetic is mastered, there's little incentive to find diverse ways to perform it. The performance improvement comes from discovering new combinations of established skills, which is the core function of planning.

This insight exposes a core inefficiency in algorithms like GRPO that apply optimization pressure uniformly across all tokens. If the learning frontier is in planning tokens but gradient signal is diluted across execution tokens, optimization is wasteful. HICRA addresses this by concentrating optimization on planning tokens, achieving significant performance gains.

The connection to existing insights is illuminating. Since Which sentences actually steer a reasoning trace?, HICRA's planning tokens are likely the same phenomenon identified from a mechanistic perspective. The two-phase dynamic also explains why Do reasoning cycles in hidden states reveal aha moments? — the graph structure reflects the transition from procedural execution (local structure) to strategic planning (global topology).

Inquiring lines that read this note 120

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can AI agents autonomously learn and transfer skills across tasks?

Which AI interaction patterns preserve learning while which ones degrade skill formation?

Can self-supervised signals enable process supervision without human annotation?

Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?

How do training priors constrain what context information can override?

How does in-context learning trigger phase transitions in model behavior?

Why do LLM chatbots fail as independent therapeutic agents?

Does therapy environment difficulty calibration affect RL policy learning quality?

What constrains reinforcement learning's ability to expand model reasoning?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Does decoupling planning from execution improve multi-step reasoning accuracy?

What structural advantages do diffusion language models offer over autoregressive methods?

Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?

How do self-generated feedback mechanisms enable effective model learning?

What determines success in training models on multiple tasks?

How do policy learning algorithm choices affect multi-objective optimization stability?

Why do zero-advantage rollouts destabilize training beyond just wasting compute?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does AI adoption affect human skill development and labor equality?

Does constraining AI access during early task phases preserve skill formation?

Does reinforcement learning teach reasoning or just when to reason?

What properties determine whether reward signals teach genuine reasoning?

Do outcome-only reward signals miss step-level errors that compound later?

What makes weaker teacher models effective for stronger student training?

How can weak-to-strong progressive training target planning without interfering with grounding?

Do base models contain latent reasoning that training can unlock?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Why do multi-turn conversations degrade AI intent and coherence?

How does single-turn training undermine multi-turn strategic dialogue?

How should conversational agents balance goal-driven initiative with user control?

Can hierarchical reinforcement learning manage phase-dependent initiative switching in dialogue?

How does latent reasoning compare to verbalized chain-of-thought?

Why do reward structures fail to shape long-term agent learning?

How can AI systems learn from failures without cascading errors?

How does sliding the start state backward create informative learning signals?

Can alternative training methods improve on supervised fine-tuning for language models?

Can continuous spectrum training outperform sequential SFT-then-RL stages?

How does memorization interact with learning and generalization?

How do out-of-distribution tests reveal that optimization learning is memorization?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

Why do agents confidently report success despite actually failing tasks?

What training objectives could reduce completion bias in autonomous agents?

How should agents balance memory condensation to optimize context efficiency?

How does memory folding enable agents to reconsider strategies mid-task?

What memory abstraction level best enables agent knowledge reuse?

What distinguishes working memory from strategic memory in agent task execution?

Can AI systems balance emotional competence with factual reliability?

How does curriculum learning prevent instability in social-emotional RL training?

How does example difficulty affect learning efficiency in language models?

How does the optimal difficulty band shift as the model's capabilities improve during training?

Can language model RL training avoid reward hacking and misalignment?

How can process reward models supervise complex reasoning traces?

How does process-based reward differ from outcome-only reward in training?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 174 in 2-hop network ·dense cluster Open in graph ↗

Does RL training follow a predictable two-phase … Which sentences actually steer a reasoning trace? Do reasoning cycles in hidden states reveal aha mo… Does policy entropy collapse limit reasoning perfo… Does RL teach reasoning or just when to use it? What happens inside models when they suddenly gene… Can language modeling close the knowing-doing gap …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Which sentences actually steer a reasoning trace? Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
converges: planning tokens in HICRA likely correspond to thought anchors
Do reasoning cycles in hidden states reveal aha moments? What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
extends: the two-phase dynamic explains how graph topology evolves during training
Does policy entropy collapse limit reasoning performance in RL? As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
reframes: entropy collapse may be acceptable for execution tokens but catastrophic for planning tokens
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
deepens: the "when" is specifically about planning tokens; execution tokens are "how"
What happens inside models when they suddenly generalize? Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
analogous phased development: grokking's memorization-then-circuit-formation parallels the procedural-then-strategic progression; both show that generalization requires passing through a consolidation phase before higher-order structure emerges
Can language modeling close the knowing-doing gap in AI? Current LLMs reason well but act poorly in interactive tasks, while RL agents act well but cannot explain themselves. Can reformulating decision-making as language modeling with environmental feedback bridge this fundamental split?
TiG operates on the same procedural-vs-strategic axis HICRA identifies, but at the architectural level: language-as-policy refined by RL preserves declarative reasoning while building procedural competence — HICRA's two-phase dynamic predicts the order TiG observes during training

Does RL training follow a predictable two-phase learning sequence?

Inquiring lines that read this note 120

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 5