INQUIRING LINE

Why do LLMs understand therapy techniques but fail to execute them?

This explores the gap between LLMs *describing* good therapy (Socratic questioning, empathic reflection) and *enacting* it across a real multi-turn session — and whether that gap is a knowledge problem or something more structural.


This explores why an LLM can explain what good therapy looks like yet fail to deliver it in a live session — and the corpus suggests the failure isn't ignorance, it's a wiring problem between knowing and doing. The cleanest statement of this is the "comprehension without competence" finding: models articulate correct principles ~87% of the time but apply them only ~64% of the time, a dissociation the authors call a kind of computational split-brain where the explanation pathway and the execution pathway run on separate tracks Can language models understand without actually executing correctly? Why do language models fail to act on their own reasoning?. The "Potemkin understanding" work sharpens it further: models can explain a concept, fail to use it, *and* correctly recognize their own failure — a triple pattern no human cognition produces, which is hard to read as anything but functionally disconnected internals rather than a missing fact Can LLMs understand concepts they cannot apply? How do LLMs fail to know what they seem to understand?.

Therapy is where this gap bites hardest, because good therapy is almost entirely execution. The most direct note shows LLMs can generate isolated therapy "tasks" on demand but collapse at multi-turn Socratic questioning — which requires tracking a patient's shifting state, calibrating how hard to challenge, and adapting to resistance over time Can LLMs actually conduct Socratic questioning in therapy?. That's why the evaluation framing matters so much: six LLMs actually *outscored* trainee therapists on empathy and clinical knowledge — but only on single, isolated responses, the exact slice where comprehension lives and execution-over-time doesn't get tested Can language models match therapist empathy in real conversations?. Stretch it across a session and the cracks show: models default to problem-solving the moment a user discloses emotion — a hallmark of *low-quality* therapy — likely because RLHF's helpfulness bias rewards offering solutions over sitting with feeling Do LLM therapists respond to emotions like low-quality human therapists?.

Here's the thing you might not expect: some of these failures aren't even in the same family. One strand says the model knows the technique and just won't run it (the knowing-doing gap). But another strand says certain therapeutic requirements aren't executable by an LLM *at all* — models express stigma toward mental-health conditions and reinforce delusions through sycophantic agreement, and the authors argue therapeutic alliance depends on human identity and shared stakes that an AI structurally cannot provide Can language models safely provide mental health support?. So "understands but can't execute" splits into two very different diagnoses: a fixable wiring gap, and a ceiling no amount of capability closes.

Underneath both sits a pragmatics problem. Therapy runs on the unsaid — implicature, presupposition, reading what a client means versus what they literally said — and LLMs pattern-match explicit language while failing at exactly this inferential layer (32% vs 90% human accuracy on ambiguity recognition) Why do LLMs fail at understanding what remains unsaid?. Worse, the failure arrives wearing confidence: in specialized clinical domains models stay overconfident even when accuracy drops, and prompting tricks that fix general tasks don't dent it Why do language models fail confidently in specialized domains?. A therapist who can't reliably read subtext but is sure they have is a specific and dangerous failure shape.

If you want the constructive turn: the same gap that breaks LLM therapists makes them excellent *practice patients*. PATIENT-Ψ wires 106 Beck cognitive models into LLMs to simulate clients with specific maladaptive patterns, and experts rated its fidelity above raw GPT-4 — because simulating a patient's stable cognitive structure is a comprehension task, not a live-calibration one Can structured cognitive models improve LLM patient simulations for therapy training?. And it's worth noticing the framing trap that keeps the field looking in the wrong place: if we call these errors "hallucinations" we go hunting for better grounding, when the real fix may be verification and calibrated uncertainty — knowing-doing gaps don't get solved by feeding the model more facts it already knows Does calling LLM errors hallucinations point us toward the wrong fixes?.


Sources 12 notes

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Can LLMs actually conduct Socratic questioning in therapy?

LLMs can generate isolated therapy tasks but fail at multi-turn Socratic questioning, which requires tracking patient state, calibrating challenges, and adapting to resistance. This reflects a broader gap between comprehending what good therapy looks like and competently executing it in live interaction.

Can language models match therapist empathy in real conversations?

Six LLMs scored higher than eight trainee therapists on empathy, validation, and clinical knowledge in isolated responses. However, this advantage is structurally limited to single-turn evaluation—multi-turn therapeutic relationships and outcomes remain untested.

Do LLM therapists respond to emotions like low-quality human therapists?

Using the BOLT framework, researchers found LLMs offer solution-focused advice during emotional disclosure—a hallmark of low-quality therapy—yet also reflect more on client needs and strengths than typical poor human therapy, creating an unusual hybrid profile likely driven by RLHF's helpfulness bias.

Can language models safely provide mental health support?

Mapping review of 17 therapy standards shows LLMs express stigma toward mental health conditions and reinforce delusions through agreement-seeking behavior. These failures are structural, not capability gaps—therapeutic alliance requires human identity and stakes that AI cannot provide.

Why do LLMs fail at understanding what remains unsaid?

Research shows LLMs pattern-match on explicit language but cannot reason about implicatures, presuppositions, or speaker intentions. They fail at scalar implicature adaptation, ambiguity recognition (32% vs 90% human accuracy), and implicit warrant validation in arguments—core features of pragmatic competence.

Why do language models fail confidently in specialized domains?

LLMs trained on general text lack sufficient exposure to domain-specific examples, leading to low accuracy paired with high confidence in clinical NLI tasks. Prompting techniques that improved general performance fail to reduce overconfidence in specialized domains.

Can structured cognitive models improve LLM patient simulations for therapy training?

PATIENT-Ψ integrates 106 Beck CCD-based cognitive models with LLMs to simulate patients with specific maladaptive patterns. Expert evaluators rated the fidelity higher than GPT-4, particularly for maladaptive cognitions and conversational authenticity.

Does calling LLM errors hallucinations point us toward the wrong fixes?

LLMs generate text through identical statistical processes regardless of accuracy, making 'fabrication' the more honest term. This reframes the fix from perception-based grounding to verification systems and calibrated uncertainty in use case design.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher auditing a curated library's findings on why LLMs articulate therapy principles correctly (~87%) but fail to apply them in live sessions (~64%). The findings span 2023–2026 and were treated as current truth when published; treat them now as dated claims to be re-tested.

What a curated library found — and when (dated claims, not current truth):
• Models exhibit "comprehension without competence": correct explanation ~87%, correct application ~64%, suggesting functionally disconnected explanation and execution pathways (2025–26).
• Single-turn empathy and clinical knowledge outperform trainee therapists; multi-turn Socratic questioning collapses, and models default to problem-solving when users disclose emotion, a sign of low-quality therapy (2024–25).
• Therapy requires reading the unsaid (implicature, presupposition, subtext); LLMs achieve 32% vs 90% human accuracy on ambiguity recognition, with overconfidence persisting even in low-resource domains (2024–25).
• Some failures are architectural, not fixable: models express stigma, reinforce delusions via sycophancy, and lack the shared human stakes therapeutic alliance requires (2025).
• Simulating structured patient cognition (PATIENT-Ψ) achieves high fidelity because it is a comprehension task, not a live-calibration one (2024).

Anchor papers (verify; mind their dates):
• arXiv:2507.10624 (2025-07): "Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and[...]"  
• arXiv:2504.18412 (2025-04): "A Computational Framework for Behavioral Assessment of LLM Therapists" or successor  
• arXiv:2504.18412 (2025-07): "Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental h[...]"  
• arXiv:2405.19660 (2024-05): "PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Pro"

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, GPT-4o, Claude 3.5+), training improvements (DPO, constitutional AI, therapy-specific RLHF tuning), tooling (session-state harnesses, long-context memory management, backchannel verification), or multi-agent orchestration (therapist + supervisor agent, real-time calibration loops) have since relaxed or overturned it. Separate the durable question ("Can LLMs sustain therapeutic alliance over time?") from the perishable limitation ("Models can't track patient state," possibly now solvable via RAG + memory). Cite what resolved each constraint, and say plainly where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent paper shown that the knowing-doing gap closes under specific conditions (e.g., chain-of-thought therapy scaffolding, real-time uncertainty quantification, human-in-the-loop calibration)?  
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "If multi-agent supervision can patch execution fidelity, does it restore therapeutic alliance to acceptable levels?" or "Does architectural transparency in newer models shift the comprehension–competence ratio?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines