SYNTHESIS NOTE
Psychology, Society, and Alignment

Can local language models rate therapy engagement reliably?

Explores whether using a local LLM to generate engagement ratings produces psychometrically sound measurements comparable to traditional human-rated scales, while preserving data privacy.

Synthesis note · 2026-04-18 · sourced from Psychology Therapy Practice
What makes therapeutic chatbots actually work in clinical practice? How do you build domain expertise into general AI models?

LLEAP (Large Language Model Engagement Assessment in Psychological Therapies) introduces a methodological shift: instead of using LLMs to directly assess a construct, it uses LLM responses as items in a psychometric rating scale — mirroring traditional scale construction but replacing human raters with a local Llama 3.1 8B model. Applied to automatically transcribed videos of 1,131 sessions from 155 patients, the approach shows strong psychometric properties: reliability omega = 0.953, acceptable model fit (CFI = 0.968, SRMR = 0.022), and significant correlations with engagement determinants (motivation r = .413, alliance), processes (between-session effort r = .390), and outcomes (symptom reduction r = -.304).

The methodological contribution is the bridge between NLP and classical psychometrics. Rather than treating LLM outputs as direct measurements (where validity is opaque), the approach subjects LLM-generated ratings to the same psychometric evaluation framework — item analysis, factor structure, reliability, convergent and discriminant validity — that would be applied to any new rating scale. The 120-item pool is reduced to the top 8 items for the final scale, following standard scale construction principles.

Two practical advantages stand out. First, local implementation: running Llama 3.1 8B locally ensures that confidential therapy session data never leaves the institution — addressing the privacy barrier that blocks clinical use of cloud-based LLMs. Second, interpretability: because the scale uses discrete, human-readable items rather than opaque embeddings, clinicians can understand exactly what is being measured. Since Can we measure therapist-patient alliance from dialogue turns in real time?, LLEAP extends the automated measurement toolkit from alliance to engagement — and the psychometric validation framework provides a template that could be applied to any construct measurable from transcripts.

The approach also addresses a key limitation of traditional measurement: response burden. Self-report instruments require patient participation and are prone to social desirability bias. Observer-based ratings require intensive training and time. Automated transcript analysis eliminates both burdens while maintaining measurement rigor. Since Do therapists accurately perceive the working alliance with patients?, automated measurement from transcripts — rather than from self-report — may capture engagement dynamics that neither therapists nor patients accurately report.

Inquiring lines that use this note as a source 29

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 72 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM-generated rating scales for therapy transcripts achieve strong psychometric properties — enabling automated patient engagement measurement without human raters or cloud data exposure