How does automated transcript analysis compare to patient self-report on engagement?
This explores whether reading the conversation itself — having an LLM or algorithm score what was said — tells you something different from asking patients to rate their own engagement, and which signal you should trust.
This explores how machine-read transcripts stack up against patients rating their own engagement — and the corpus suggests the two aren't measuring the same thing, which is exactly why comparing them is interesting. The strongest case for automated analysis comes from LLEAP, where a local Llama model scored over a thousand therapy sessions and hit high reliability (omega ~0.95) while correlating with motivation, effort, and symptom outcomes Can local language models rate therapy engagement reliably?. A related system, COMPASS, goes finer still, inferring therapist–patient alliance turn by turn from dialogue rather than from an end-of-session questionnaire Can we measure therapist-patient alliance from dialogue turns in real time?. The pitch is that the transcript is continuous, objective, and doesn't depend on a patient pausing to introspect.
The catch is that self-report has its own well-documented blind spots — and they cut in a direction that flatters automated analysis. Patients reliably report a genuine emotional bond with therapeutic chatbots, but that bond score floats free of whether the bot is clinically safe or whether it's quietly disrupting the patient's own emotional signaling; a single felt-connection number conflates dimensions that should stay separate Do therapeutic chatbot bond scores hide deeper safety problems?. So a high self-reported engagement can coexist with a session that a transcript reading would flag as going wrong. COMPASS makes this concrete: for anxiety and depression, patient and therapist alliance signals converge over time, but for suicidality they stay persistently misaligned — a gap a self-report bond score would paper over Can we measure therapist-patient alliance from dialogue turns in real time?.
What's genuinely surprising is that transcripts carry engagement signals neither party would think to self-report. Therapist first-person pronoun frequency negatively predicts alliance and measured patient trust, while patient disfluencies — filler pauses — actually mark relaxed, stronger rapport Does therapist self-reference language predict weaker therapeutic alliance?. Nobody fills out a survey saying 'I trusted them less because they said I too often.' The transcript sees structure the questionnaire can't ask about, which is the real argument for automated analysis: not that it replaces self-report, but that it reads a different channel.
The corpus also warns, though, that machines reading transcripts hallucinate engagement that isn't there. Therapists reviewing GPT-4 found it 'reads into' user feelings, injecting emotional interpretations the user never expressed Do language models add feelings users never actually expressed?. This is the mirror image of the self-report problem — and it rhymes with the finding that LLM self-reports mostly echo training-data distributions rather than any real internal state Can language models actually introspect about their own states?. A model scoring engagement can be projecting the same way a model reporting on itself does. So neither the patient's account nor the machine's reading is a clean ground truth; each fails in a characteristic way.
The deeper lesson the corpus keeps circling is that 'engagement' as a metric is treacherous on its own terms — optimizing it can backfire, as when better, more informative AI summaries reduced click-through because users no longer needed to engage Does better summary writing actually increase user engagement?. That's why the most promising work doesn't just measure engagement but feeds it back as a live signal: R2D2 turns turn-level alliance scores into a reward that recommends what the therapist should do next Can reinforcement learning optimize therapy dialogue in real time?. The comparison, in the end, isn't transcript-versus-self-report as rival truths — it's that automated analysis gives you a continuous, structural, sometimes-projecting read, self-report gives you a felt-but-conflated one, and the interesting systems triangulate between them rather than picking a winner.
Sources 8 notes
LLEAP achieved reliability (omega=0.953) and valid correlations with motivation, effort, and symptom outcomes using Llama 3.1 8B to rate 1,131 therapy sessions, while keeping data locally stored.
COMPASS maps dialogue turns onto WAI embeddings to produce 36-dimensional alliance scores per turn. Anxiety and depression show convergence in alliance metrics over time, while suicidality shows persistent misalignment between patient and therapist.
Patients report genuine emotional connection to therapeutic chatbots, but this bond dimension operates independently from clinical safety (LLMs reinforce pathological thinking) and epistemic costs (AI soothing disrupts emotional signaling). Single metrics conflate these separate dimensions.
High frequency of therapist 'I' usage correlates with lower patient-reported alliance and reduced trusting behavior in validated behavioral tasks. Patient non-fluency markers like filler pauses, conversely, signal relaxed communication and stronger alliance.
Therapists reviewing GPT-4 in the CaiTI system found it "reads into" user feelings rather than responding objectively. Task decomposition across specialized models (Reasoner/Guide/Validator) reduces but does not eliminate this interpretation bias.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.
Nextdoor experiments showed LLM-generated summaries were objectively more informative but decreased click-through rates. Users had no reason to open notifications when the summary already satisfied their information need, demonstrating how optimizing for informativeness can backfire on engagement metrics.
R2D2 demonstrates that RL agents trained on multi-objective working alliance scores can generate disorder-specific policies that recommend treatment strategies in real time. The system operates as an AI supervisor, transcribing sessions and recommending next topics based on task, bond, and goal alignment.