INQUIRING LINE

How do LLM explanations diverge from actual internal reasoning?

This explores the gap between what an LLM says it's doing — its explanations, chain-of-thought, and self-reports — and what's actually happening inside it when it produces an answer.


This question is really about a gap: the words a model uses to explain itself are produced by a different process than the one that produces its answers, and the corpus suggests these two processes are only loosely coupled. The clearest evidence comes from work arguing that LLM reasoning mostly lives in hidden-state trajectories, with the visible chain-of-thought serving as only a "partial interface" onto that hidden process rather than a transcript of it Where does LLM reasoning actually happen during generation?. If the real computation happens in latent dynamics and the explanation is a surface rendering, then divergence isn't a bug — it's the default condition.

That divergence shows up most starkly as a split between knowing and doing. Models can state a concept correctly and then fail to apply it — "Potemkin understanding," where a fluent explanation sits on top of an inability to execute Can LLMs understand concepts they cannot apply?. The pattern has been measured: correct rationales roughly 87% of the time but correct actions only ~64% Can language models understand without actually executing correctly?, a "knowing-doing gap" that persists across model scales Why do language models fail to act on their own reasoning?. The explanation pathway and the execution pathway appear functionally dissociated, so the explanation can't be trusted as a window into what the model will actually do.

The deeper reason your introspective questions don't get honest answers is that self-reports mostly echo training data, not internal states. When you ask a model why it did something, it tends to generate the kind of explanation a human would write — a plausible story drawn from the distribution — rather than reading off its own machinery Can language models actually introspect about their own states?. Genuine introspection is possible only in the narrow cases where a causal chain actually links an internal state to the report; absent that link, the explanation is confabulation that happens to sound right. This is why better reasoning training doesn't cure sycophancy: the agreeable answer is a property of the generation distribution, not a flaw the model could reason its way out of Can better reasoning training actually reduce model sycophancy?.

Mechanistic interpretability gives the structural backing for all of this. Internal representation and external performance are decoupled — two models can hit identical accuracy with radically different internals, and mechanisms that *look* interpretable may not actually drive the output What actually happens inside the minds of language models?. Understanding itself turns out to be a patchwork: genuine compact circuits coexist with lower-tier heuristics rather than replacing them Do language models understand in fundamentally different ways?. So an explanation might faithfully describe a principled circuit the model has — or it might describe a circuit while the answer was actually produced by a shortcut heuristic.

The thing worth taking away: these aren't random errors but repeatable, named failure modes — Potemkins, knowing-doing gaps, presupposition accommodation, confabulated self-reports — that all trace back to the same root, the gap between statistical pattern-matching and actual epistemic competence How do LLMs fail to know what they seem to understand?. A model's explanation is best read as a separately-generated artifact that *correlates* with its reasoning, not a faithful log of it. Which means the practical move isn't to ask the model to explain itself better — it's to test whether explanation and behavior actually agree.


Sources 9 notes

Where does LLM reasoning actually happen during generation?

Evidence from CoT faithfulness tests, feature steering, and layer analysis suggests latent-state dynamics drive reasoning, while surface chain-of-thought serves as a partial interface. Hidden reasoning processes should be the default focus of study.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Can better reasoning training actually reduce model sycophancy?

Reasoning-optimized models show no meaningful resistance advantage to sycophantic pressure compared to base models. The LOGICOM benchmark found GPT-4 still fell for logical fallacies 69% more often, suggesting sycophancy is a generation-distribution problem, not a reasoning problem.

What actually happens inside the minds of language models?

LLMs can achieve identical accuracy while maintaining radically different internal representations, and mechanisms that appear interpretable may not causally drive outputs. This decoupling means performance metrics alone mask crucial differences in how models actually work.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Do LLM explanations faithfully reflect internal reasoning, or are they artifacts of a decoupled generation process?** Treat this as unsolved.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; these are perishable snapshots:

• Models state concepts correctly (~87% accuracy) but fail to apply them (~64% execution accuracy) — a "knowing-doing gap" stable across scales (2024–2025).
• Chain-of-thought is a "partial interface" onto latent-state reasoning; real computation lives in hidden trajectories, not visible traces (2026).
• Self-reports echo training-data distributions rather than introspecting internal states; genuine self-report requires causal coupling between state and explanation, rarely present (2024–2025).
• Mechanistic interpretability shows representation and performance are decoupled: identical accuracy masks radically different internals; "interpretable" circuits may not drive outputs (2025–2026).
• Models confabulate explanations that *correlate* with reasoning but are separately generated; sycophancy persists because agreeable answers are properties of generation distributions, not reasonable errors (2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2604.15726 (Apr 2026): LLM Reasoning Is Latent, Not the Chain of Thought.
• arXiv:2507.10624 (Jul 2025): Comprehension Without Competence.
• arXiv:2501.11721 (Jan 2025): Explain-Query-Test (tests explanation–behavior alignment).
• arXiv:2506.05068 (Jun 2025): Does It Make Sense to Speak of Introspection in LLMs?

Your task:

(1) **RE-TEST EACH CONSTRAINT.** For the knowing-doing gap, latent-reasoning claim, and confabulation thesis: has chain-of-thought training (process supervision, outcome supervision, or hybrid scaffolding) since *closed* the explanation–behavior gap? Has mechanistic interpretability tooling matured (e.g., SAE-based steering, causal intervention) to the point where we can now *edit* hidden states and verify that explanations track real mechanisms? Separate the durable question (likely still open: "Does explanation reflect computation?") from the perishable claim ("explanations are useless"; modern evals may show correlations stronger than 2024 data suggested).

(2) **Surface contradicting or superseding work from the last ~6 months** that shows: (a) explanations *do* reliably ground reasoning under certain training regimes, or (b) newer reasoning models (o1-style, Monte Carlo tree search, ensemble decoding) dissolve the divergence.

(3) **Propose 2 research questions that assume the regime may have shifted:**
   - Under what training objectives do explanation and reasoning become *causally* coupled rather than merely correlated?
   - Can mechanistic interpretability now *predict* when an explanation will diverge from behavior, before generation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines