What distinguishes planning knowledge from an executable plan that works?
This explores the gap between an LLM knowing *how* a plan should go (the right steps, in the right vocabulary) and producing one that actually runs without breaking — and what the corpus says causes that gap.
This explores the gap between an LLM knowing *how* a plan should go and producing one that actually runs. The corpus is unusually direct here: only about 12% of GPT-4's generated plans are executable without errors, even though the model is fluent at describing what a plan should contain Can large language models actually create executable plans?. The distinction isn't about missing knowledge. It's about *assembly* — handling how subgoals collide, how resources get consumed, and how one step constrains the next. Planning knowledge is naming the ingredients; an executable plan is the reasoning that holds them together under real constraints.
The sharpest framing of this divide comes from work on "comprehension without competence": models articulate correct principles at 87% accuracy but apply them correctly only 64% of the time — a split the authors call a kind of computational split-brain, where the pathway that *explains* and the pathway that *executes* are dissociated rather than one being a deficit of the other Can language models understand without actually executing correctly?. That reframes the whole question: the failure isn't ignorance, it's a structural disconnect between knowing and doing. It rhymes with the broader finding that LLMs track statistical regularities of language without genuine epistemic competence — they capture the *form* of knowledge with measurable, recurring gaps where real knowing would be What do language models actually know?. You can even see the form/substance split in reasoning itself: logically *invalid* chain-of-thought exemplars perform nearly as well as valid ones, because the model is learning the shape of reasoning, not the inference Does logical validity actually drive chain-of-thought gains?.
The most interesting move in the corpus is what to do about it — and the answer is mostly architectural, not "make the model smarter." Several independent lines converge on *separating the planner from the executor.* Splitting a decomposer model from a solver model improves accuracy, and notably the decomposition skill transfers across domains while solving does not — evidence they're genuinely different capabilities that interfere when bundled together Does separating planning from execution improve reasoning accuracy?. GUI agents reach the same conclusion from a different direction: planning and grounding have *opposing* optimization requirements and pull against each other inside one policy, so the fix is an intermediate interface that lets each be tuned independently Why do planning and grounding pull against each other in agents? How should agents split planning from visual grounding?. And LLM Programs wrap the model in explicit algorithmic control flow, feeding each call only the context it needs for that step — treating execution as a debuggable scaffold around the model rather than something the model must hold in its head Can algorithms control LLM reasoning better than LLMs alone?.
There's also a quieter set of findings about *why execution wanders* even when a path is available. Reasoning models fail less from lack of compute than from structural disorganization — they explore invalid branches and abandon promising paths prematurely, "like tourists, not scientists"; the viable plan was reachable but got dropped mid-stride Why do reasoning models abandon promising solution paths?. Two cheap interventions push against this without retraining: backward planning, which exploits goal-side bottlenecks so constraints bite earlier in the search Does planning direction affect how hard problems become?, and lookahead tokens baked into training data, which let a model condition on future information using standard infrastructure Can embedding future information in training data improve planning?.
The thing you might not have expected to learn: the corpus suggests planning *knowledge* is the transferable, generalizable part — procedural knowledge drawn broadly from pretraining is what drives reasoning that ports across tasks Does procedural knowledge drive reasoning more than factual retrieval? — while reliable *execution* is the brittle, domain-specific, easily-disorganized part. So the answer to "what distinguishes them" is almost inverted from intuition: the abstract knowing travels well, and it's the concrete doing that keeps falling apart.
Sources 12 notes
Only 12% of GPT-4 generated plans are actually executable without errors. LLMs excel at acquiring planning knowledge but fail at the reasoning assembly required to handle subgoal and resource interactions.
Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.
LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
AutoGLM's research shows planning and grounding have opposing optimization requirements that pull against each other when bundled in one policy. An intermediate interface that separates them lets each capability be developed and optimized independently while still composing into a complete agent.
Multiple independent systems (Agent S, AutoGLM, OmniParser) converged on factoring agent reasoning into a planning layer and a grounding layer, with a language-centric Agent-Computer Interface mediating between them due to their opposing optimization requirements.
LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Problems with bottlenecks near the goal become easier to solve by planning backward, because constraints appear earlier in the backward chain. Combined forward and backward planning with verification improved success by 4–24% across domains.
TRELAWNEY augments training data with special tokens encapsulating future information, allowing models to learn goal-conditioned generation using standard infrastructure. Results show improved planning, algorithmic reasoning, and story generation without modifying architecture or training procedures.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.