INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›How can models identify insufficie…›this inquiring line

A model can reason its way to the right move — then take the wrong one anyway, consistently.

Can a model predict the right action but execute the wrong one?

This explores the gap between a model knowing the correct action and actually taking it — whether reasoning the right move and executing it are separate abilities that can come apart.

This reads the question as being about the "knowing-doing gap": can a model work out the right move and then fail to make it? The corpus says yes, and surprisingly often. The clearest evidence comes from work showing LLMs generate correct rationales about 87% of the time but actually act on that reasoning only 64% of the time Why do language models fail to act on their own reasoning?. The model isn't confused — it states the right plan and then defaults to a greedy, frequency-biased choice instead. That's a 23-point gap between knowing and doing that persists across model sizes, which means scaling alone doesn't close it.

Why would prediction and action diverge? One answer is that being accurate "on average" is not the same as being right where it counts. A model can fit data well overall yet systematically mispredict in exactly the decision-critical states that determine the outcome Why do accurate predictions lead to poor decisions?. So even a model with the right general picture can execute the wrong action precisely at the moments that matter most — accuracy and good decisions are formally distinct properties.

The failure also compounds once the model is acting in a loop. When a model's own earlier mistakes fill its context, performance degrades non-linearly — it starts conditioning on its own errors and digging deeper Do models fail worse when their own errors fill the context?. This matters because post-training pushes models from passive prediction toward treating their outputs as actions that shape future inputs Do models recognize their own outputs as actions shaping future inputs?, so a single wrong execution doesn't just cost one step — it contaminates everything downstream. There's also a directional bias baked into how models update: they're optimistic about actions they chose and pessimistic about the roads not taken Do language models learn differently from good versus bad outcomes?, which can lock in a wrong action even when the better one was knowable.

The useful twist is that some of these gaps are trainable rather than fundamental. Reinforcement learning can narrow the knowing-doing gap directly Why do language models fail to act on their own reasoning?, and there's a related lesson about *how* you reward: binary correct/incorrect rewards push models toward confident execution of wrong answers because they never penalize confident mistakes — adding a calibration term fixes this without sacrificing accuracy Does binary reward training hurt model calibration?. So the right-prediction-wrong-action problem is partly an artifact of training signals that reward acting decisively over acting correctly.

The thing worth walking away with: "knowing" and "doing" are genuinely separate capabilities in these systems, and the gap between them is a measurable, distinct failure mode — not just noise in a model that's otherwise too small. A model that reasons perfectly can still be a bad agent, and fixing the reasoning won't automatically fix the acting.

Sources 6 notes

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Why do accurate predictions lead to poor decisions?

Research formalizes necessary and sufficient conditions for predictive models to support optimal decisions. A model can predict accurately on average yet systematically mispredict in decision-critical states.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

Show all 6 sources

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about the knowing-doing gap in LLMs — whether models can predict correct actions but execute wrong ones. A curated library (arXiv, 2023–2026) found the gap is real, measurable, and partly trainable. Your job is to determine what still holds and what has shifted.

What a curated library found — and when (dated claims, not current truth):
- LLMs generate correct rationales 87% of the time but act on them only 64% of the time; this 23-point gap persists across model sizes (2024–2025).
- Predictive accuracy and decision optimality are formally distinct; models optimized for data fit systematically mispredict in decision-critical states (2025).
- Self-conditioning: when a model's own prior errors fill its context, performance degrades non-linearly, amplifying future mistakes (2025).
- Post-training shifts models from passive prediction to enaction; wrong execution contaminates downstream context (2026).
- In-context learning agents exhibit asymmetric belief updating — optimism bias for chosen actions, pessimism for foregone ones — locking in suboptimal decisions (2024).

Anchor papers (verify; mind their dates):
- arXiv:2402.03969 (2024): In-context learning agents and asymmetric belief updating.
- arXiv:2509.09677 (2025): Long-horizon execution and error accumulation in LLMs.
- arXiv:2605.25459 (2026): Post-training and enaction — models recognizing their own outputs as actions.
- arXiv:2506.01347 (2025): Negative reinforcement and reasoning robustness.

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 87%-to-64% gap, the self-conditioning effect, and asymmetric belief updating, determine whether newer models (o1, Claude 3.5, Grok variants), method advances (process reward models, tree search, in-context calibration), tooling (structured generation, state reset mechanisms), or multi-agent orchestration (critic loops, debate) have since relaxed or closed these gaps. Separate the durable question ("Do models systematically fail to execute correct plans?") from perishable limitations ("Does scaling alone fix this?"). Be explicit about what resolved each constraint, if anything.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — e.g., papers showing the gap closes under specific training regimes, or arguing it's an artifact of evaluation rather than a real failure mode.
(3) Propose 2 research questions that ASSUME the regime may have moved: one on whether the gap disappears under agentic scaffolding (e.g., external verifiers, rollback), and one on whether the gap is actually a feature — an optimal policy for uncertain or ambiguous settings.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

A model can reason its way to the right move — then take the wrong one anyway, consistently.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8