How can weak-to-strong progressive training target planning without interfering with grounding?
This explores whether you can train a model's planning ability in escalating stages (weak-to-strong / progressive) while leaving its grounding — its ability to stay tied to real observations and actions — untouched, and the corpus suggests the two capabilities are separable but only if you stage and route training carefully.
This explores how to push planning forward in steps without damaging grounding, and the first thing the corpus says is that the two pull against each other by nature. AutoGLM's work on GUI agents found that planning and grounding have *opposing* optimization requirements, so bundling them in one policy makes improvement on one degrade the other — their fix was an intermediate interface that lets each be developed independently before composing them back together Why do planning and grounding pull against each other in agents?. That separation is the precondition for any progressive scheme: you can't 'target planning without interfering with grounding' until planning is something you can address on its own.
There's a striking finding that progression may already be the natural shape of training. Across eight models, RL training unfolds in two phases — first execution correctness (the grounded, procedural skill) is consolidated, and only then does strategic planning become the bottleneck, with planning-token entropy rising while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. Concentrating optimization on planning tokens *after* grounding has settled is exactly the weak-to-strong recipe the question asks about, and it suggests the cleanest target is the planning tokens themselves once execution has stopped moving.
Order turns out to matter mechanically, not just conveniently. Omni-Thinker showed structured tasks drive entropy *down* while open-ended tasks drive it *up*, and scheduling structured-first (a backward-transfer-guided curriculum) beat joint training by preventing entropy collapse from wrecking the more exploratory capabilities Does training order reshape how models handle different task types?. Read against the question, this is a warning: a progressive schedule that hammers grounded/structured skill too hard can collapse the very entropy that planning needs to explore — so 'without interfering' is partly about protecting headroom for the later stage.
Two more notes give you levers for touching planning lightly. TRELAWNEY embeds lookahead tokens carrying future information directly into the training data, improving planning and goal-conditioned generation with no architectural change and standard infrastructure — a data-side way to add planning without reshaping the grounded policy Can embedding future information in training data improve planning?. And ReAct shows the grounding side wants to stay interleaved with reasoning: alternating verbal reasoning with real tool queries injects external feedback that prevents the planner's errors from compounding Can interleaving reasoning with real-world feedback prevent hallucination?. So planning gains shouldn't cut the model off from the observations that keep it honest.
The quiet thread under all of this is plasticity. Low KL drift from the base model preserves a model's ability to keep learning across task changes, while parameter-only RL stalls when the domain shifts Does staying close to the base model preserve learning ability? — which is what makes a *second* (planning) stage land at all. And the cautionary mirror image: overly hard samples make models learn degenerate shortcuts that contaminate skills they already had Do overly hard RLVR samples actually harm model capabilities?. The takeaway a curious reader might not expect: 'don't interfere with grounding' isn't mainly about freezing grounding — it's about keeping the model plastic and staying off difficulty that backfires, so the strong-planning stage builds on a base it hasn't quietly corroded.
Sources 7 notes
AutoGLM's research shows planning and grounding have opposing optimization requirements that pull against each other when bundled in one policy. An intermediate interface that separates them lets each capability be developed and optimized independently while still composing into a complete agent.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.