Can local language models rate therapy engagement reliably?

Explores whether using a local LLM to generate engagement ratings produces psychometrically sound measurements comparable to traditional human-rated scales, while preserving data privacy.

Synthesis note · 2026-04-18 · sourced from Psychology Therapy Practice

LLEAP (Large Language Model Engagement Assessment in Psychological Therapies) introduces a methodological shift: instead of using LLMs to directly assess a construct, it uses LLM responses as items in a psychometric rating scale — mirroring traditional scale construction but replacing human raters with a local Llama 3.1 8B model. Applied to automatically transcribed videos of 1,131 sessions from 155 patients, the approach shows strong psychometric properties: reliability omega = 0.953, acceptable model fit (CFI = 0.968, SRMR = 0.022), and significant correlations with engagement determinants (motivation r = .413, alliance), processes (between-session effort r = .390), and outcomes (symptom reduction r = -.304).

The methodological contribution is the bridge between NLP and classical psychometrics. Rather than treating LLM outputs as direct measurements (where validity is opaque), the approach subjects LLM-generated ratings to the same psychometric evaluation framework — item analysis, factor structure, reliability, convergent and discriminant validity — that would be applied to any new rating scale. The 120-item pool is reduced to the top 8 items for the final scale, following standard scale construction principles.

Two practical advantages stand out. First, local implementation: running Llama 3.1 8B locally ensures that confidential therapy session data never leaves the institution — addressing the privacy barrier that blocks clinical use of cloud-based LLMs. Second, interpretability: because the scale uses discrete, human-readable items rather than opaque embeddings, clinicians can understand exactly what is being measured. Since Can we measure therapist-patient alliance from dialogue turns in real time?, LLEAP extends the automated measurement toolkit from alliance to engagement — and the psychometric validation framework provides a template that could be applied to any construct measurable from transcripts.

The approach also addresses a key limitation of traditional measurement: response burden. Self-report instruments require patient participation and are prone to social desirability bias. Observer-based ratings require intensive training and time. Automated transcript analysis eliminates both burdens while maintaining measurement rigor. Since Do therapists accurately perceive the working alliance with patients?, automated measurement from transcripts — rather than from self-report — may capture engagement dynamics that neither therapists nor patients accurately report.

Inquiring lines that read this note 29

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can we distinguish genuine user preferences from measurement artifacts?

How does unidimensionality in assessments affect measurement validity?

How can real-time alliance measurement improve therapy outcomes?

Why do LLM chatbots fail as independent therapeutic agents?

How do evaluation biases undermine LLM quality assessment systems?

How can LLM recommenders match or exceed collaborative filtering performance?

Can topic embeddings make RL dialogue recommendations interpretable to clinicians?

Why should disagreement be treated as signal in collaborative reasoning?

Can decreased engagement be distinguished from genuine semantic contradiction?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

Can synchrony metrics automatically evaluate the quality of therapeutic AI conversations?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

Do LLMs show stigma or reinforce delusions in mental health contexts?

How should dialogue systems best leverage conversation history for retrieval?

Should memorability systems rely on individual reports instead of group-level signals?

Why do benchmark improvements fail to reflect actual reasoning quality?

What privacy-preserving evaluation methods best capture real-world forecasting ability?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 71 in 2-hop network ·medium cluster Open in graph ↗

Can local language models rate therapy engagemen… Can we measure therapist-patient alliance from dia… Do therapists accurately perceive the working alli… Can AI generate assessment questions as good as hu… Can reinforcement learning optimize therapy dialog…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we measure therapist-patient alliance from dialogue turns in real time? Explores whether computational methods can detect working alliance quality at turn-level resolution during therapy sessions, enabling immediate feedback on whether the therapeutic relationship is strengthening.
COMPASS measures alliance; LLEAP measures engagement; both from transcripts; LLEAP adds psychometric validation
Do therapists accurately perceive the working alliance with patients? This research explores whether therapists' own assessments of the therapeutic relationship match what patients actually experience, especially in high-risk cases like suicidality.
automated measurement bypasses the self-report and therapist-report biases that distort alliance data
Can AI generate assessment questions as good as human experts? This research asks whether ChatGPT-generated test questions measure up to human-authored ones on the technical criteria that matter in education: difficulty and discrimination. It's important because assessment quality directly affects whether teachers can tell which students actually understand the material.
LLMs generating assessment items vs LLMs as raters in a psychometric framework; complementary approaches to LLM-based measurement
Can reinforcement learning optimize therapy dialogue in real time? Can RL systems trained on working alliance scores recommend therapy topics that improve clinical outcomes during live sessions? This explores whether validated clinical constructs can serve as reward signals for dialogue optimization.
engagement measurement could serve as additional signal for AI supervisor systems

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM-generated rating scales for therapy transcripts achieve strong psychometric properties — enabling automated patient engagement measurement without human raters or cloud data exposure

Can local language models rate therapy engagement reliably?

Inquiring lines that read this note 29

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4