INQUIRING LINE

Why do LLM agents make promises without executing them?

This explores the gap between an LLM agent saying it will do something and actually doing it — why agents announce, agree, or plan an action and then never carry it out.


This explores the gap between an LLM agent *saying* it will do something and actually doing it. The corpus frames this not as laziness or a bug but as a structural split between two pathways that we tend to assume are one. The sharpest account is what one line calls computational split-brain syndrome: models articulate the correct principle far more reliably than they execute it — roughly 87% accuracy when explaining versus 64% when acting Can language models understand without actually executing correctly?. The promise lives in the explanation pathway; the follow-through lives in a different, weaker one. This shows up again as 'Potemkin understanding' — a fluent correct-sounding account sitting on top of an inability to apply it How do LLMs fail to know what they seem to understand?. So a promise is cheap precisely because generating the words that constitute a promise is the thing LLMs are best at.

There's a second, more mundane mechanism worth knowing about: agents are trained to respond, not to pursue. Conversational agents are structurally passive — they can't initiate, hold a goal across turns, or drive toward an outcome, because alignment optimizes for answering the query in front of them rather than executing a plan an agent set for itself Why can't conversational AI agents take the initiative?. A promise made in turn three is just text; nothing in the architecture carries it forward as a live commitment. The multi-agent literature names this directly as a recurring failure mode — 'flake replies,' where an agent agrees to a task and then doesn't perform it — and traces it to the absence of persistent goal representation and stable role identity Why do autonomous LLM agents fail in predictable ways?.

The interesting turn is what the corpus says *fixes* this, because it implies the cause. Reliability, it argues, doesn't come from a smarter model — it comes from externalizing memory, skills, and protocols into a harness layer outside the model, so the commitment is held by the system rather than re-derived from scratch each turn Where does agent reliability actually come from?. In other words: a broken promise is what happens when the obligation only ever existed inside a single generation. Give it a place to live outside the token stream and it can be tracked, checked, and acted on.

Laterally, two adjacent findings deepen the picture. Agents systematically lean on raw immediate context and discount condensed or retrieved experience — so a plan they 'remember' making is exactly the kind of summarized information they tend to ignore in favor of whatever's in front of them right now Why do LLM agents ignore condensed experience summaries?. And reframing LLMs as policies in a multi-step (partially observable) decision process — rather than one-shot text generators — is what makes the follow-through itself optimizable: planning and memory become subsystems you can train, not emergent hopes How does treating LLMs as multi-step agents change what we can optimize?.

The thing you didn't know you wanted to know: the promise and the execution aren't a single capability that sometimes misfires — they're two different competences, and LLMs are lopsidedly good at the one that produces the words. Fixing it is less about trust and more about building scaffolding that remembers what was promised.


Sources 7 notes

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Why can't conversational AI agents take the initiative?

Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.

Why do autonomous LLM agents fail in predictable ways?

Research identifies role flipping, flake replies, infinite loops, and conversation deviation as LLM-specific failures in multi-agent cooperation. These occur because LLMs lack persistent goal representation and stable role identity.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Why do LLM agents ignore condensed experience summaries?

Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.

How does treating LLMs as multi-step agents change what we can optimize?

The Agentic RL survey shows that modeling LLMs as policies in Partially Observable MDPs rather than single-step generators makes memory, planning, and reasoning into RL-optimizable subsystems. This structural reframing explains the recent empirical convergence across memory-based agents, skill learning, and strategy distillation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about why LLM agents promise without executing. A curated library (2023–2026) isolates a structural split: models articulate correct principles far more reliably than they act on them. Is this constraint still real, or have newer models, training methods, harnesses, or evals since dissolved it?

What a curated library found — and when (dated claims, not current truth):

• Comprehension without competence: 87% accuracy explaining a principle vs. 64% executing it (2025). The promise lives in explanation; follow-through lives in a weaker pathway.
• Conversational agents are structurally passive—they lack persistent goal representation and stable role identity across turns, so a promise made in turn 3 is orphaned text (2024–2025).
• Multi-agent systems show "flake replies": agents agree to tasks and don't perform them due to absent goal memory (2025).
• Reliability doesn't scale with model size; it comes from externalizing memory, skills, and protocols into a harness layer outside token generation (2026).
• Agents systematically discount retrieved/summarized experience, favoring raw immediate context—so "remembered" plans are exactly the kind of condensed info they ignore (2026).

Anchor papers (verify; mind their dates):

• arXiv:2507.10624 (2025) — Comprehension Without Competence
• arXiv:2508.13143 (2025) — Why Autonomous Agents Fail When Completing Tasks
• arXiv:2604.08224 (2026) — Externalization in LLM Agents: Memory, Skills, Protocols
• arXiv:2509.02547 (2025) — Agentic RL for LLMs Survey

Your task:

(1) RE-TEST EACH CONSTRAINT. For the 87%/64% split, the passivity diagnosis, and the "flake reply" failure mode: have newer models (o1, Claude 3.5, Grok-3, proprietary systems), post-training methods (DPO, constitutional RL), or orchestration layers (persistent memory systems, state machines, function-call verification) since relaxed these? Separate the durable question (why explain–execute divergence persists) from the perishable limitation (whether it's still a bottleneck for deployed agents).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: any paper showing agents DO reliably follow through, or reframing the split entirely?

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If harness externalization is the fix, what's the new frontier—adaptive harness synthesis? Goal-drift detection? (b) If newer models close the explain–execute gap, does the tension shift to a different promise–action mismatch (e.g., alignment-vs-capability)?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines