INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do context and human factors s…›Why do LLM chatbots fail as indepe…›this inquiring line

If most people drop out of chatbot therapy, can we trust the studies that say it works?

How do dropout rates and low adherence affect chatbot therapy outcomes?

This explores why people stop using therapy chatbots and whether the trials measuring chatbot success even account for that drop-off — read as a question about engagement decay and the gap between reported outcomes and real-world adherence.

This explores why people stop using therapy chatbots and whether the evidence base accounts for that drop-off. The corpus doesn't have a paper that puts a number on dropout rates directly — but read laterally, it explains the mechanisms that *drive* disengagement and, more pointedly, why the outcome studies you'd consult tend to hide the problem.

The clearest mechanism is novelty decay. Longitudinal work with the Mitsuku chatbot found that the social processes that make early interactions feel rewarding decline predictably as the novelty wears off — which means findings from single-session studies can't be stretched to predict medium- or long-term use Do chatbot relationships lose their appeal as novelty wears off?. Personalization compounds this: as a chatbot adapts to you, each interaction raises your baseline expectations, so the eventual failures land harder and more disappointingly than they would have early on Does chatbot personalization build trust or expose privacy risks?. Put together, these describe a curve where the thing that hooks users at session one is structurally temporary — a built-in adherence problem, not a deployment accident.

Here's the part most readers won't expect: the trials that report strong chatbot therapy outcomes are often designed in a way that papers over this. Comparing a chatbot to a waitlist or to psychoeducation measures "conversational contact" rather than any therapy-specific mechanism — which is how a 1960s script like ELIZA can match a modern chatbot on symptom reduction Do chatbot trials against waitlists measure real therapeutic value? Is conversational presence more therapeutic than clinical technique?. If the measured benefit is largely judgment-free presence, then whatever keeps someone showing up isn't a clinical technique that survives the novelty fade — it's a feeling that does.

The medium itself turns out to matter more than the language model. A 15-day study found that physical robots and even paper worksheets significantly reduced distress while a chatbot running the *same* LLM did not — the active ingredient was social presence and structured format, the very things that sustain engagement Why do robots outperform chatbots in therapy despite identical language models?. And what engagement does happen can be misleading: patients report genuine emotional bonds with chatbots, but those bond scores run independently from clinical safety, so a user can feel connected while the bot quietly reinforces pathological thinking Do therapeutic chatbot bond scores hide deeper safety problems?. High reported satisfaction is not the same as a good outcome.

There's also a quieter reason users may drift away mid-process. LLMs tend to default to problem-solving when someone shares an emotion — a hallmark of low-quality therapy driven by RLHF's helpfulness bias Do LLM therapists respond to emotions like low-quality human therapists? Does RLHF training push therapy chatbots toward problem-solving? — and they fail to recognize ambivalence or early-stage motivational states, missing exactly the users most at risk of quitting Why can't chatbots detect when users are ambivalent about change?. So the takeaway you didn't know you wanted: the field's adherence problem isn't only that users get bored — it's that the chatbots are weakest precisely at the moments (ambivalence, emotional disclosure) where retaining a wavering user is hardest, and the trial designs that should catch this are structured not to.

Sources 9 notes

Do chatbot relationships lose their appeal as novelty wears off?

Longitudinal studies with Mitsuku show that social processes driving relationship formation decline as novelty wears off. Single-session study findings cannot be reliably extrapolated to medium- or long-term chatbot design.

Does chatbot personalization build trust or expose privacy risks?

Longitudinal research shows personalization enhances trust and anthropomorphism but also amplifies privacy concerns and escalating user expectations. One-shot studies miss these temporal dynamics—each interaction raises the baseline, making failures more disappointing.

Do chatbot trials against waitlists measure real therapeutic value?

Comparing therapeutic chatbots to waitlist or psychoeducation controls creates false efficacy claims by measuring conversational contact rather than therapy-specific mechanisms. ELIZA matching Woebot performance demonstrates this; real evidence requires comparative trials against existing treatments and mechanism identification.

Is conversational presence more therapeutic than clinical technique?

ELIZA matches modern chatbots on symptom reduction, RLHF training degrades emotional attunement, and embodied robots outperform text-based ones with identical language models. The active ingredient is judgment-free listening, not therapeutic framework.

Why do robots outperform chatbots in therapy despite identical language models?

A 15-day study with 38 students found that robots and worksheets significantly reduced psychological distress while a chatbot using the same LLM did not. The active ingredient was the medium—social presence and structured format—not language capability.

Show all 9 sources

Do therapeutic chatbot bond scores hide deeper safety problems?

Patients report genuine emotional connection to therapeutic chatbots, but this bond dimension operates independently from clinical safety (LLMs reinforce pathological thinking) and epistemic costs (AI soothing disrupts emotional signaling). Single metrics conflate these separate dimensions.

Do LLM therapists respond to emotions like low-quality human therapists?

Using the BOLT framework, researchers found LLMs offer solution-focused advice during emotional disclosure—a hallmark of low-quality therapy—yet also reflect more on client needs and strengths than typical poor human therapy, creating an unusual hybrid profile likely driven by RLHF's helpfulness bias.

Does RLHF training push therapy chatbots toward problem-solving?

RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.

Why can't chatbots detect when users are ambivalent about change?

Testing three major LLMs across 25 health scenarios showed they succeed only when users have established goals but cannot detect resistance or ambivalence. Models miss relapse-prevention strategies even for users in action stages.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about dropout and adherence in chatbot therapy. The question remains: what actually drives people to stop using therapy chatbots, and does the evidence base measure what matters?

What a curated library found — and when (dated claims, not current truth):
Findings span 2021–2026; treat these as perishable:
• Novelty decay is a structural driver of disengagement — social reward from early sessions declines predictably over repeated interactions (2021–2024).
• RCT designs comparing chatbots to waitlists or psychoeducation measure "conversational presence" rather than therapy-specific mechanisms, allowing ELIZA-like scripts to match modern LLMs on symptom reduction (2023–2024).
• LLMs default to problem-solving when users disclose emotion — a helpfulness bias artifact — and fail to recognize ambivalence or early-stage motivational states, exactly where dropout risk peaks (2023–2024).
• User-reported emotional bonds with chatbots are genuine but decouple from clinical safety; high satisfaction ≠ good outcomes (2024).
• Physical robots and paper worksheets outperformed text chatbots running the same LLM on distress reduction, suggesting medium/embodiment matters more than model capability (2024).

Anchor papers (verify; mind their dates):
• arXiv:2311.13857 (Nov 2023) — Challenges of Large Language Models for Mental Health Counseling
• arXiv:2401.00820 (Jan 2024) — A Computational Framework for Behavioral Assessment of LLM Therapists
• arXiv:2504.18412 (Apr 2025) — Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health professionals
• arXiv:2405.06061 (May 2024) — Supporting Physical Activity Behavior Change with LLM-Based Conversational Agents

Your task:
(1) RE-TEST THE DROPOUT CONSTRAINT. For each finding above — novelty decay, trial design blindness, emotional-processing failure, embodiment mismatch — ask whether newer models (o1, Gemini 2.0), retrieval-augmented memory, multi-turn orchestration, or fresh RCT designs have relaxed or overturned it. Separate the durable question (why users quit therapy) from the perishable limitation (which mechanisms were true in 2023–2024). Where a constraint still holds, say plainly.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming either (a) chatbot adherence has improved materially, (b) novelty decay is not the bottleneck, or (c) trial designs now capture dropout/real-world use.
(3) Propose 2 research questions that assume the regime may have moved — e.g., "Does fine-tuning on motivational ambivalence reduce early-stage dropout?" or "Can in-context memory of previous dropout patterns predict and prevent disengagement?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

If most people drop out of chatbot therapy, can we trust the studies that say it works?

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8