Can large language models actually create executable plans?
Do LLMs genuinely assemble plans that work, or just generate planning-domain knowledge that sounds coherent? Understanding this distinction matters for deploying AI in real planning tasks.
Solving planning tasks requires two distinct capabilities: (a) having planning domain knowledge — actions, preconditions, effects, hierarchical recipes, past cases — and (b) assembling that knowledge into an executable plan that handles subgoal and resource interactions. LLMs are strong at (a) and fail at (b). Only about 12% of plans that GPT-4 generates are actually executable without errors and goal-reaching.
The confusion between these two capabilities explains much of the conflicting literature on LLM planning. "Many papers claiming planning abilities of LLMs, on closer examination, wind up confusing general planning knowledge extracted from the LLMs for executable plans. When all we are looking for are abstract plans, such as 'wedding plans,' with no intention of actually executing said plans directly, it is easy to confuse them for complete executable plans."
Self-critiquing makes things worse, not better. LLMs "hallucinate both false positives and false negatives while verifying the solutions they generate." With self-verification, performance actually diminishes compared to systems with external sound verifiers. The nature of feedback — whether binary or detailed — shows minimal impact on generation, suggesting "the core issue lies in the LLM's binary verification capabilities rather than the granularity of feedback." Since Does self-revision actually improve reasoning in language models?, the self-critiquing failure in planning is the same mechanism operating on a different task type.
The proposed architecture is the LLM-Modulo framework: a generate-test-critique loop where LLMs generate candidate plans and a bank of external critics evaluates them. LLMs play multiple roles — guessing candidates, translating formats, helping users flesh out specifications, helping experts acquire domain models — but are never ascribed planning or verification abilities. Plans produced by this compound system have formal soundness guarantees because of the external critics.
Kambhampati's framing is precise: "LLMs are amazing giant external non-veridical memories that can serve as powerful cognitive orthotics for human or machine agents, if rightly used." The "non-veridical" is key — LLMs reconstruct completions probabilistically rather than indexing and retrieving exactly. "The boon ('creativity') and bane ('hallucination') of LLMs is that n-gram models will naturally mix and match."
Since Can language models understand without actually executing correctly?, the planning finding is a specific instance: LLMs comprehend planning domains (extract valid action descriptions, preconditions, effects) without being competent to execute plans (sequence actions that handle interactions and constraints). Since Why do language models fail to act on their own reasoning?, planning adds a third data point to the knowing-doing gap: 87% correct rationales in sequential decisions, 64% correct actions, and now 12% executable plans — the gap widens as task complexity increases.
Papers: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks, Can Large Language Models Really Improve by Self-critiquing Their Own Plans?, Can Large Language Models Reason and Plan?
Inquiring lines that use this note as a source 12
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do LLMs generate ideas that sound novel but fail during execution?
- What distinguishes planning knowledge from an executable plan that works?
- What explains the 87 percent to 12 percent cliff in plan executability?
- What components must wrap an LLM to build a working CRS?
- What interference occurs when planning and synthesis happen in the same component?
- Why do LLMs understand efficient language but fail to produce it?
- Does LLM reasoning always match the outputs it generates?
- What planning tasks benefit most from combining LLM generation with external verification?
- Can the LLM-Modulo framework extend solver integration to domain planning?
- Which LLM backends produce the most executable research ideas?
- Do LLMs need world models to make accurate predictions?
- How does this differ from using LLMs as the policy itself?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models understand without actually executing correctly?
Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.
planning is the paradigmatic case: comprehension of domain without competence to execute
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
planning extends the gap: 87% → 64% → 12% as complexity increases
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-critiquing failure generalizes from reasoning to planning
-
Can symbolic solvers fix how LLMs reason about logic?
LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?
LLM-Modulo is the planning-domain instantiation of the same principle
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
planning heuristics without world models explains why knowledge extraction works but plan assembly fails
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Can Large Language Models Reason and Plan?
- Position: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks
- On the Roles of LLMs in Planning: Embedding LLMs into Planning Graphs
- Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1
- Chain of Thoughtlessness? An Analysis of CoT in Planning
- Can Large Language Models Really Improve by Self-critiquing Their Own Plans?
- Branch-Solve-Merge Improves Large Language Model Evaluation and Generation
- Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations
Original note title
LLMs confuse planning knowledge for executable plans — only 12 percent of GPT-4 generated plans are executable and self-critiquing worsens performance