Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
Across eight text-only and vision-language models, RL training reveals a consistently two-phase dynamic. In the first phase, the learning bottleneck is procedural correctness — a single calculation error invalidates an entire solution, creating powerful gradient signal that compels mastery of low-level execution tokens (arithmetic, variable substitution, formula application). In the second phase, the bottleneck shifts to strategic planning — exploring and mastering high-level planning tokens (deduction like "we can use the fact that," branching like "let's try a different approach," backtracing like "but the problem mentions that").
The phases are not mutually exclusive. Procedural refinement continues throughout training. But the primary driver of marginal performance gains shifts to strategic planning. This is why the "aha moment" phenomenon appears when it does — it represents the discovery and internalization of high-level reasoning strategies, which only becomes the active learning frontier after procedural skills are consolidated.
The entropy dynamics tell the same story. Planning tokens show increasing strategic diversification over training — the model explores new ways to combine established skills. Execution tokens show stable conditional entropy — once arithmetic is mastered, there's little incentive to find diverse ways to perform it. The performance improvement comes from discovering new combinations of established skills, which is the core function of planning.
This insight exposes a core inefficiency in algorithms like GRPO that apply optimization pressure uniformly across all tokens. If the learning frontier is in planning tokens but gradient signal is diluted across execution tokens, optimization is wasteful. HICRA addresses this by concentrating optimization on planning tokens, achieving significant performance gains.
The connection to existing insights is illuminating. Since Which sentences actually steer a reasoning trace?, HICRA's planning tokens are likely the same phenomenon identified from a mechanistic perspective. The two-phase dynamic also explains why Do reasoning cycles in hidden states reveal aha moments? — the graph structure reflects the transition from procedural execution (local structure) to strategic planning (global topology).
Inquiring lines that use this note as a source 111
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Which AI interaction patterns preserve learning while which ones degrade skill formation?
- Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?
- How does in-context learning trigger phase transitions in model behavior?
- Does therapy environment difficulty calibration affect RL policy learning quality?
- What behavioral changes occur during reward learning training?
- How does entropy collapse in reinforcement learning differ from entropy maintenance in graph reasoning?
- Why must procedural skills consolidate before strategic reasoning can develop?
- Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?
- What makes bilevel metacognition architectural rather than emergent in current systems?
- How do training objectives shape what a world model actually learns?
- Does task ordering affect multi-task reinforcement learning outcomes?
- How do developmental curriculums emerge from learning progress signals?
- Why do zero-advantage rollouts destabilize training beyond just wasting compute?
- How should guidance levels adapt as the model's capability boundary shifts?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- Why does early experience provide better warm-starts for downstream reinforcement learning?
- Does constraining AI access during early task phases preserve skill formation?
- Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?
- Do outcome-only reward signals miss step-level errors that compound later?
- Can in-context learning replicate the timing effects that RL teaches models?
- Does policy entropy collapse represent the main bottleneck in reasoning-focused RL scaling?
- How can weak-to-strong progressive training target planning without interfering with grounding?
- What capabilities actually require massive scale versus specialized training regimes?
- Can meta-reinforcement learning explain why this bias pattern emerges rationally?
- Do emergent abilities result from genuine new capabilities or implicit in-context learning?
- Can RL teach when to use reasoning versus when to respond directly?
- How does dual-rate learning separate episodic and procedural memory in neural networks?
- How does single-turn training undermine multi-turn strategic dialogue?
- Does reinforcement learning learn optimal per-turn reasoning discipline?
- Can hierarchical reinforcement learning manage phase-dependent initiative switching in dialogue?
- Does policy entropy collapse limit how many iterations of reasoning training work?
- Do depth thresholds correspond to transitions between procedural and strategic learning?
- How do residual connections and layer norm stabilize training in deep RL?
- How do evaluative versus directive signals differ in next-state training?
- Can architectural changes like decoupling intent understanding help overcome next-turn reward limitations?
- Why do next-turn reward objectives fail to encourage multi-turn goal progress?
- How does policy entropy during training affect search discipline during inference?
- How does sliding the start state backward create informative learning signals?
- How does reinforcement learning differ from chain-of-thought distillation?
- Does RL refine existing knowledge or discover entirely new capabilities?
- How does RL compress reasoning path diversity during training?
- What limits RL's ability to scale for reasoning at training time?
- Do thought anchors correspond mechanistically to planning tokens in RL?
- Why does policy entropy collapse predict sigmoid saturation points?
- Which recipe choices determine the asymptotic ceiling in RL training?
- Can RL training teach models when to activate reasoning versus when to skip it?
- What happens to model reasoning when policy entropy collapses during RL?
- Why do high entropy tokens carry most of the learning signal in RL?
- How does next-turn reward optimization contribute to agent passivity?
- How does behavior cloning reduce complexity before RL training in rerankers?
- Why do models follow a two-phase pattern of procedural then strategic learning?
- What distinguishes RL that creates new capabilities from RL that merely teaches timing?
- Can models learn both what and how to study through reinforcement learning?
- How does temporal anchoring maintain the learning signal in self-rewarding loops?
- How does representational convergence differ from policy entropy collapse in iterative training?
- Can continuous spectrum training outperform sequential SFT-then-RL stages?
- Does RL training actually restore the critical thinking that reasoning models lose?
- What makes Effective Rank Acceleration a stable training signal for dual-channel incentives?
- What separates bootstrapping gains from sustained self-improvement gains?
- Does format-based pretraining determine how models respond to reinforcement learning?
- How does trajectory burstiness compare to other structural properties that shape emergent capabilities?
- How does reinforcement learning on outcomes reinforce template-matching rather than computation?
- Can out-of-distribution tests expose memorization in reinforcement learning fine-tuned models?
- Why does prolonged RL discover strategies absent from any base model sample?
- How does policy initialization with sub-policies enable emergent thinking?
- Why does imitation learning alone plateau without outcome-based refinement?
- How do out-of-distribution tests reveal that optimization learning is memorization?
- Can reinforcement learning fix the reasoning gaps that supervised fine-tuning misses?
- What structural differences emerge between early generic skills and later meta-strategy skills?
- How do complete multi-turn trajectories differ from isolated task examples?
- What training objectives could reduce completion bias in autonomous agents?
- Why does RL behavior differ between standard reasoning tasks and complex planning domains?
- How do reward signals in RLVR interact with pretraining biases?
- How does post-training shift models from passive prediction to on-policy action?
- How do high-entropy tokens concentrate reinforcement learning's effect?
- How does memory folding enable agents to reconsider strategies mid-task?
- Does reinforcement learning teach models how to reason or when to reason?
- Does grokking in modular arithmetic follow the same three-phase learning trajectory?
- Does RL training activate latent meta-learning capacity or create it from scratch?
- Why does decoupling planning from execution improve over sequential interleaving?
- What scaling properties emerge from RL training dynamics beyond verification?
- What distinguishes working memory from strategic memory in agent task execution?
- How does curriculum learning prevent instability in social-emotional RL training?
- How does on-policy entropy recognition differ from training-time entropy collapse?
- Why do single-turn RL methods fail to generalize to multi-turn tasks?
- How should multi-objective post-training balance competing behavioral goals?
- What training duration is actually needed for RL to expand capabilities?
- Does RL primarily teach when to use reasoning or how to reason?
- What training interventions could close the perception-action gap?
- What does RL post-training actually teach reasoning systems?
- How do complementary learning systems explain the need for fast and slow consolidation?
- Why does policy entropy collapse when scaling RL for reasoning?
- What makes supervised fine-tuning worsen RL exploration later?
- What capacity threshold determines whether RL teaches activation versus shortcut learning?
- Can entropy regularization or critique models prevent search strategy collapse during RL training?
- Can early experience replace external rewards as a learning signal?
- How does stage-wise training scheduling resolve conflicts between constraint-following and creative tasks?
- Does the productive difficulty band ever stabilize during training?
- How does the optimal difficulty band shift as the model's capabilities improve during training?
- Does RL training redirect self-doubt into productive gap analysis?
- What causes policy entropy collapse in reasoning-focused reinforcement learning?
- How does pretraining determine what RL can later teach a model?
- When does reinforcement learning actually produce true reasoning gains in models?
- Can training order and structure shape what networks retain and learn?
- How does model scale affect anticipatory behavior in structured training?
- How does early commitment in reasoning differ from early exploitation in planning?
- Does targeting the edge of competence during RL pretraining unlock true reasoning gains?
- Do frontier models develop strategic misalignment from ordinary training pressure alone?
- How does process-based reward differ from outcome-only reward in training?
- What makes exploration a verifiable and measurable training objective?
- What makes advantage shaping more stable than reward shaping for tool training?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
converges: planning tokens in HICRA likely correspond to thought anchors
-
Do reasoning cycles in hidden states reveal aha moments?
What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
extends: the two-phase dynamic explains how graph topology evolves during training
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
reframes: entropy collapse may be acceptable for execution tokens but catastrophic for planning tokens
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
deepens: the "when" is specifically about planning tokens; execution tokens are "how"
-
What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
analogous phased development: grokking's memorization-then-circuit-formation parallels the procedural-then-strategic progression; both show that generalization requires passing through a consolidation phase before higher-order structure emerges
-
Can language modeling close the knowing-doing gap in AI?
Current LLMs reason well but act poorly in interactive tasks, while RL agents act well but cannot explain themselves. Can reformulating decision-making as language modeling with environmental feedback bridge this fundamental split?
TiG operates on the same procedural-vs-strategic axis HICRA identifies, but at the architectural level: language-as-policy refined by RL preserves declarative reasoning while building procedural competence — HICRA's two-phase dynamic predicts the order TiG observes during training
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Emergent Hierarchical Reasoning In LLMs Through Reinforcement Learning
- Reinforcement Learning with Rubric Anchors
- The Art of Scaling Reinforcement Learning Compute for LLMs
- From Trial-and-Error to Improvement: A Systematic Analysis of LLM Exploration Mechanisms in RLVR
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- Can Large Language Models Develop Strategic Reasoning? Post-training Insights from Learning Chess
- Teaching Large Language Models to Reason with Reinforcement Learning
- The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Original note title
rl training exhibits a two-phase dynamic where procedural consolidation precedes strategic planning exploration