INQUIRING LINE

Why do models follow a two-phase pattern of procedural then strategic learning?

This explores why reinforcement learning models tend to master execution mechanics first and only later optimize higher-level planning — and what makes that ordering show up so reliably.


This explores why RL models seem to learn in two stages — getting the procedure right before getting the strategy right. The most direct evidence is that this isn't a quirk of one model: across eight models, RL training reliably shows a first phase where execution correctness is the bottleneck, then a second phase where strategic planning becomes the thing worth optimizing. You can even watch it in the numbers — entropy on planning tokens keeps rising while execution entropy settles, and pushing optimization onto those planning tokens is where the late gains come from Does RL training follow a predictable two-phase learning sequence?. The ordering looks less like a training schedule and more like a dependency: you can't fruitfully explore strategy until the moves you'd execute are reliable.

Why procedural first? Because procedure is the more transferable, more broadly-supported kind of knowledge to begin with. Analysis of millions of pretraining documents shows reasoning leans on broad procedural patterns drawn from many sources, while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. A model arrives at RL already carrying procedural scaffolding — so the cheapest early wins are consolidating and sharpening execution it can already half-do, before it has any stable base to plan over.

There's a deeper reason the phases can't easily collapse into one: RL on verifiable rewards mostly activates and reweights strategies the model already has rather than installing new ones What does reward learning actually do to model reasoning?. A single example can trigger activation and even spurious rewards work nearly as well — which means early training is essentially surfacing latent procedural competence, and only once that's stabilized does the harder work of choosing *which* competence to deploy (the strategic layer) become the live constraint. This same shape recurs when you add supervision: SFT-then-RL runs through a shift-readapt-overfit progression where the model first absorbs expert procedure, then must be steered to keep exploring Why does SFT-then-RL training follow a predictable three-phase pattern?, and step-wise expert-similarity rewards work best precisely as a *curriculum foundation* — dense procedural signal first, outcome-based strategic refinement after Can step-wise expert rewards help small models learn hard reasoning?.

The corpus also hints at what the strategic phase is actually solving. Once execution is solid, the remaining failures look like planning failures: models abandoning reasoning paths mid-exploration and switching ideas too soon, which a simple penalty on thought-transition tokens fixes without retraining Do reasoning models switch between ideas too frequently?. And the strategic layer isn't one thing — different models settle into distinct reasoning styles (minimax, trust-based, belief-anticipation) tied to the structure of the problem Do large language models use one reasoning style or many?. That diversity is exactly what you'd expect to emerge in a second phase: strategy is where models can differ, because procedure is where they had to converge.

The thing worth carrying away: the two-phase pattern probably isn't something training *imposes* — it's a consequence of what RL can and can't do. If reward mostly activates existing capability, then learning has to bottom out on the broadly-supported procedural skills first and only then move the bottleneck up to the narrower, model-specific business of strategy. The phase boundary is the moment execution stops being the scarce resource.


Sources 7 notes

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Why does SFT-then-RL training follow a predictable three-phase pattern?

CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-evaluating claims about two-phase learning (procedural → strategic) in RL-trained LLMs. The question remains: *why* do models exhibit this ordering, and is it fundamental or contingent on current training regimes?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024 to May 2026. Key constraints identified:
- Execution correctness is the early bottleneck; planning entropy rises only *after* execution entropy settles (~2025-01, arXiv:2501.18585).
- RL on verifiable rewards mostly *activates* latent procedural skill rather than installing new strategies; single examples trigger activation (~2025-07, arXiv:2507.14843).
- Procedural knowledge is broadly supported in pretraining (many sources); factual recall is narrow and document-specific (~2024-11, arXiv:2411.12580).
- SFT-then-RL exhibits shift-readapt-overfit; step-wise expert-similarity rewards work best as *curriculum foundation* (~2025-10, arXiv:2510.25992).
- Strategic phase is not monolithic: models settle into distinct reasoning styles (minimax, trust-based) by problem structure (~2025-02, arXiv:2502.20432).

Anchor papers (verify; mind their dates):
- arXiv:2507.14843 (Jul 2025): RLVR dynamics and activation limits.
- arXiv:2501.18585 (Jan 2025): Underthinking as premature thought-switching.
- arXiv:2411.12580 (Nov 2024): Procedural vs. factual knowledge in pretraining.
- arXiv:2510.25992 (Oct 2025): SFT-then-RL curriculum progression.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above—especially the claim that RL mostly activates existing procedure—judge whether newer model scale, instruction-tuning methods, multi-agent scaffolding, or test-time compute (sampling, tree search, process rewards) have since *collapsed* the two phases or *inverted* their order. Separate the durable question (why this ordering if it still holds?) from the perishable limitation (is it an artifact of supervised RL, or fundamental?). Cite what changed it.
(2) Surface the strongest *disagreement* or *superseding* work from the last ~6 months. Does any paper claim procedure and strategy *co-emerge*, or that strategic learning can *precede* procedural consolidation under different reward structures?
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can curriculum-free, outcome-only RL collapse the phases? (b) Do scaling laws or architectural changes (e.g., mixture-of-experts, dynamic routing) alter the bottleneck order?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines