SYNTHESIS NOTE

Can large language models actually create executable plans?

Do LLMs genuinely assemble plans that work, or just generate planning-domain knowledge that sounds coherent? Understanding this distinction matters for deploying AI in real planning tasks.

Synthesis note · 2026-03-30 · sourced from Tasks Planning

Solving planning tasks requires two distinct capabilities: (a) having planning domain knowledge — actions, preconditions, effects, hierarchical recipes, past cases — and (b) assembling that knowledge into an executable plan that handles subgoal and resource interactions. LLMs are strong at (a) and fail at (b). Only about 12% of plans that GPT-4 generates are actually executable without errors and goal-reaching.

The confusion between these two capabilities explains much of the conflicting literature on LLM planning. "Many papers claiming planning abilities of LLMs, on closer examination, wind up confusing general planning knowledge extracted from the LLMs for executable plans. When all we are looking for are abstract plans, such as 'wedding plans,' with no intention of actually executing said plans directly, it is easy to confuse them for complete executable plans."

Self-critiquing makes things worse, not better. LLMs "hallucinate both false positives and false negatives while verifying the solutions they generate." With self-verification, performance actually diminishes compared to systems with external sound verifiers. The nature of feedback — whether binary or detailed — shows minimal impact on generation, suggesting "the core issue lies in the LLM's binary verification capabilities rather than the granularity of feedback." Since Does self-revision actually improve reasoning in language models?, the self-critiquing failure in planning is the same mechanism operating on a different task type.

The proposed architecture is the LLM-Modulo framework: a generate-test-critique loop where LLMs generate candidate plans and a bank of external critics evaluates them. LLMs play multiple roles — guessing candidates, translating formats, helping users flesh out specifications, helping experts acquire domain models — but are never ascribed planning or verification abilities. Plans produced by this compound system have formal soundness guarantees because of the external critics.

Kambhampati's framing is precise: "LLMs are amazing giant external non-veridical memories that can serve as powerful cognitive orthotics for human or machine agents, if rightly used." The "non-veridical" is key — LLMs reconstruct completions probabilistically rather than indexing and retrieving exactly. "The boon ('creativity') and bane ('hallucination') of LLMs is that n-gram models will naturally mix and match."

Since Can language models understand without actually executing correctly?, the planning finding is a specific instance: LLMs comprehend planning domains (extract valid action descriptions, preconditions, effects) without being competent to execute plans (sequence actions that handle interactions and constraints). Since Why do language models fail to act on their own reasoning?, planning adds a third data point to the knowing-doing gap: 87% correct rationales in sequential decisions, 64% correct actions, and now 12% executable plans — the gap widens as task complexity increases.

Papers: LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks, Can Large Language Models Really Improve by Self-critiquing Their Own Plans?, Can Large Language Models Reason and Plan?

Inquiring lines that read this note 12

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do LLM research ideas score high on novelty yet collapse into low diversity?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How can LLM recommenders match or exceed collaborative filtering performance?

What components must wrap an LLM to build a working CRS?

How should planning and perception grounding be factored in agent design?

What interference occurs when planning and synthesis happen in the same component?

Do language models learn genuine linguistic structure or just surface patterns?

Why do LLMs understand efficient language but fail to produce it?

How effectively do deterministic tools improve language model reasoning on formal tasks?

Do language models develop causal world models or rely on statistical patterns?

Do LLMs need world models to make accurate predictions?

How should we design LLM systems to maintain alignment and control?

How does this differ from using LLMs as the policy itself?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 154 in 2-hop network ·dense cluster Open in graph ↗

Can large language models actually create execut… Can language models understand without actually ex… Why do language models fail to act on their own re… Does self-revision actually improve reasoning in l… Can symbolic solvers fix how LLMs reason about log… Do foundation models learn world models or task-sp…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can language models understand without actually executing correctly? Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.
planning is the paradigmatic case: comprehension of domain without competence to execute
Why do language models fail to act on their own reasoning? LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
planning extends the gap: 87% → 64% → 12% as complexity increases
Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
self-critiquing failure generalizes from reasoning to planning
Can symbolic solvers fix how LLMs reason about logic? LLMs excel at understanding natural language but fail at precise logical inference. Can pairing them with deterministic symbolic solvers—using solver feedback to refine attempts—overcome this fundamental weakness?
LLM-Modulo is the planning-domain instantiation of the same principle
Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
planning heuristics without world models explains why knowledge extraction works but plan assembly fails

Can large language models actually create executable plans?

Inquiring lines that read this note 12

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4