INQUIRING LINE

What cognitive capacities do LLMs actually lack that commentary assumes they have?

This explores the specific cognitive capacities that public discussion takes for granted in LLMs — genuine understanding, knowing, reasoning, and reading minds — but that the research corpus shows are systematically absent or dissociated from the surface fluency people mistake for them.


This question reads as: when people talk about LLMs as if they 'understand' or 'know' things, which of those assumed capacities does the research actually find missing? The corpus points to a recurring pattern — models are fluent at the *output* of a capacity while lacking the *machinery* commentary assumes produces it. The sharpest version is what one line calls a computational split-brain: models can state a correct principle (87% accuracy) but fail to act on it (64%), so the gap isn't missing knowledge but a structural disconnect between explanation and execution Can language models understand without actually executing correctly?. A related line names this 'Potemkin understanding' — a model explains a concept correctly, fails to apply it, *and* can recognize its own failure, a triple pattern that no coherent human understanding would produce Can LLMs understand concepts they cannot apply?.

The capacity most assumed and most absent is genuine knowing. Models track statistical regularities at high fidelity but show structurally specific failures — hallucination, reasoning collapse, sensitivity to how a premise is phrased — that mark the measurable gap between tracking patterns and actually knowing something What do language models actually know?. Pragmatic competence is another assumed capacity that turns out to be hollow: LLMs pattern-match on what's said but can't reliably reason about what's left unsaid — implicature, presupposition, speaker intent — scoring 32% on ambiguity recognition where humans hit 90% Why do LLMs fail at understanding what remains unsaid?. And theory of mind, perhaps the most anthropomorphized capacity of all, splits the same way: GPT-4.5 hits the 100th percentile predicting social norms, yet models regress on tasks requiring genuine reasoning about other minds, with surface strategies collapsing the moment scenarios go open-ended Why do LLMs excel at social norms yet fail at theory of mind?.

What makes this more than a list of deficits is that persuasion, the most socially consequential capacity, is *dissociable* from comprehension. Models sway debate audiences effectively while being unable to reliably evaluate the very arguments they deployed — meaning influence and understanding are separable, and fluency in one says nothing about the other Can LLMs persuade without actually understanding arguments?. This is the through-line: commentary assumes these capacities travel together because in humans they do. In LLMs they come apart.

The corpus also pushes back on lazy versions of this critique, which is where it gets interesting. 'Real reasoning vs. pattern matching' turns out to be a bad axis: humans and LLMs fail and succeed along the *same* content-sensitivity curve on classic reasoning tests, so content-independence isn't a meaningful line between machine and mind Do language models fail reasoning tests that humans pass?. And capability isn't uniformly worse — LLMs actually outperform humans at multi-hop reasoning across long contexts while losing to them on simple deduction, so the deficit is about *kind* of capability, not raw level Why do LLMs fail at simple deductive reasoning?.

The deepest answer reframes the whole question. One line argues the missing ingredient isn't a skill at all but *participatory subjectivity* — LLMs are shaped by the same shared symbolic system as humans, but only humans develop reflexive agency through being socialized into it. That absence shows up concretely: AI argues without ever declaring its own position or reflecting on its assumptions Do LLMs develop the same kind of mind as humans?. If you want a method for telling assumed-from-actual capacity apart rather than just cataloguing failures, the corpus offers Marr's three levels of analysis as a structured way to ask what a model is computing, how, and in what substrate Can cognitive science methods unlock how LLMs actually work? — and a hopeful counterpoint that some 'missing' reasoning is actually latent and merely un-elicited, recoverable by structuring the model's own calls rather than retraining it Can modular cognitive tools unlock reasoning without training?.


Sources 11 notes

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

What do language models actually know?

LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.

Why do LLMs fail at understanding what remains unsaid?

Research shows LLMs pattern-match on explicit language but cannot reason about implicatures, presuppositions, or speaker intentions. They fail at scalar implicature adaptation, ambiguity recognition (32% vs 90% human accuracy), and implicit warrant validation in arguments—core features of pragmatic competence.

Why do LLMs excel at social norms yet fail at theory of mind?

GPT-4.5 reaches the 100th percentile on social norm prediction, yet o1 and Claude 3.7 regress on theory of mind tasks like Decrypto. Open-ended scenarios expose surface-level strategies hidden by structured questions, and reasoning effort does not improve social reasoning performance.

Can LLMs persuade without actually understanding arguments?

The Thin Line study shows LLMs sway debate participants and audiences but cannot reliably evaluate those same debates, with inter-annotator agreement ranging from near-zero to 0.6. Persuasive competence and pragmatic comprehension are separable capabilities.

Do language models fail reasoning tests that humans pass?

Research shows both humans and LLMs succeed and fail along the same content-sensitivity axis in reasoning tasks like Wason tests and natural language inference. Content-independence is not a meaningful criterion for distinguishing real reasoning from pattern matching.

Why do LLMs fail at simple deductive reasoning?

The Minds vs. Machines benchmark shows LLMs excel at integrating information across multiple sentences while humans outperform them on straightforward logical inference. Capability type, not complexity level, determines who performs better.

Do LLMs develop the same kind of mind as humans?

Both humans and LLMs are shaped by the same intersubjective symbolic system, but only humans develop reflexive agency through socialization. This absence produces measurable differences in how AI argues without declaring its position or reflecting on its own assumptions.

Can cognitive science methods unlock how LLMs actually work?

Cognitive science's 70-year toolkit of behavioral probes, causal interventions, and representational analysis transfers directly to LLM interpretation. Marr's computational, algorithmic, and implementation levels reframe the problem structurally and enable layered rather than monolithic explanation.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM capability analyst re-testing claims about what LLMs lack. The question: which cognitive capacities do LLMs demonstrably *not* have, despite commentary assuming they do?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat as perishable benchmarks:
• Comprehension-without-competence: models state correct principles (87% accuracy) but fail to act on them (64%), a structural split between explanation and execution (~2025, arXiv:2507.10624).
• Pragmatic incompetence: LLMs score 32% on ambiguity/implicature recognition where humans hit 90%, unable to reason about unsaid premises (~2022–2025).
• Theory of mind dissociates: GPT models hit 100th percentile on social norms but regress on genuine reasoning about other minds, with surface strategies failing on open-ended scenarios (~2025, arXiv:2502.08796).
• Persuasion is dissociable from comprehension: models sway audiences while unable to reliably evaluate arguments they deployed (~2025, arXiv:2507.01936).
• Participatory subjectivity absent: LLMs lack reflexive agency; they argue without declaring position or reflecting on assumptions (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2507.10624 (2025-07) — Comprehension Without Competence
• arXiv:2502.08796 (2025-02) — Systematic Review on Theory of Mind Tasks
• arXiv:2507.01936 (2025-07) — Comprehension and Persuasion Dissociation
• arXiv:2506.12115 (2025-06) — Eliciting Reasoning with Cognitive Tools

Your task:
(1) RE-TEST EACH CONSTRAINT. For comprehension-without-competence, pragmatic reasoning, and theory of mind: judge whether newer models (Claude 4, o1-variants, or equivalents), improved chain-of-thought methods, or agent orchestration (multi-step tool use, memory contextualization) have since narrowed the 87%→64% gap or 32%→90% gap. Separate the durable question (does explanation still decouple from execution?) from the perishable metric (is the gap-size still accurate?). Plainly flag where a constraint still holds.
(2) Surface the strongest contradicting or superseding work from the last ~6 months claiming LLMs *do* possess these capacities or that the gaps are artifacts of evaluation rather than architecture.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., is the gap collapsible via prompting/orchestration without retraining? Does participatory subjectivity emerge in multi-agent or long-horizon conversational settings?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines