Why does RL behavior differ between standard reasoning tasks and complex planning domains?
This explores why reinforcement learning tends to work cleanly on standard reasoning (math, short chains) but behaves differently — often worse, or in two distinct phases — once a task requires real planning, and what the corpus says is mechanically driving that gap.
This explores why RL behaves differently on standard reasoning tasks versus genuine planning problems — and the corpus suggests the answer is that these two regimes stress different bottlenecks, and RL's core mechanism only helps one of them. The cleanest framing comes from work showing RL training unfolds in two phases: a first phase where getting the execution right drives the gains, and a second phase where strategic planning becomes the actual bottleneck Does RL training follow a predictable two-phase learning sequence?. Standard reasoning tasks mostly live in that first phase — the steps are short, correctness of execution is what's scarce, and RL sharpens it efficiently. Planning domains push you into the second phase, where the hard part isn't executing a step but deciding which branch to explore at all, and that's where RL's behavior changes character.
The deeper reason is what RL actually does to a model's distribution. Several notes converge on the finding that RL doesn't teach new reasoning — it activates pretraining strategies and optimizes *when* to deploy them, sharpening sampling efficiency within existing boundaries What does reward learning actually do to model reasoning? Does RL post-training create reasoning or just deploy it?. It does this by concentrating probability mass on reward-maximizing trajectories — which is exactly what you want when there's one right execution path, and exactly what you *don't* want in planning, where you need to keep many candidate branches alive. That convergence shows up as entropy collapse: RL squeezes behavioral diversity, and the same mechanism documented in reasoning reappears in search agents, narrowing exploration Does reinforcement learning squeeze exploration diversity in search agents?. For a short reasoning chain that narrowing is harmless or helpful; for a deep planning problem it's corrosive.
Why corrosive? Because planning failures are failures of *exploration*, not execution. Reasoning models that wander — exploring invalidly, abandoning promising paths prematurely — see their success probability drop exponentially with problem depth, so medium problems stay solvable while deep ones become catastrophic Why do reasoning LLMs fail at deeper problem solving? Why do reasoning models abandon promising solution paths?. An RL objective that rewards converging fast actively worsens this, because the thing planning needs is breadth and the thing RL supplies is concentration.
The corpus also has the constructive flip side, which is the most useful part for a curious reader: the fixes for planning domains are all about *protecting* exploration against RL's narrowing pull. Allocating compute to diverse abstractions instead of deeper single chains enforces a breadth-first search that depth-only reasoning can't manage Can abstractions guide exploration better than depth alone?. Separating the planner (decomposer) from the executor (solver) prevents the two from interfering — and notably, the decomposition skill transfers across domains while the solving skill doesn't, which tells you planning and execution really are different capabilities that RL shouldn't be collapsing together Does separating planning from execution improve reasoning accuracy?. Training order matters too: structured domains drive entropy *down* while open-ended ones drive it *up*, so scheduling structured tasks first avoids entropy collapse poisoning the exploratory capabilities you'll need later Does training order reshape how models handle different task types?.
The thing you may not have known you wanted to know: RL doesn't behave differently in planning because planning is 'harder' in some vague sense — it behaves differently because RL's one trick is sharpening a distribution, and planning is the one regime where sharpening the distribution is the wrong move. Standard reasoning rewards convergence; planning punishes it.
Sources 9 notes
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.