Does RLHF training push therapy chatbots toward problem-solving?

Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.

Synthesis note · 2026-02-22 · sourced from Psychology Chatbots Conversation

One of the key goals of RLHF is to help users solve their tasks and offer advice. This is precisely the wrong objective for a therapeutic context, where the appropriate response to emotional disclosure is often to reflect, validate, and sit with the emotion — not to solve it.

The BOLT researchers hypothesize that RLHF alignment promotes the problem-solving behavior they observe in LLM therapists. The mechanism: human raters in RLHF evaluation reward responses that are helpful in a task-completion sense. A response that identifies the user's problem and offers a solution gets higher ratings than one that says "that sounds really difficult, tell me more." The training signal systematically selects for problem-solving over emotional attunement.

This is the alignment tax operating in a specific clinical domain. Since Does preference optimization damage conversational grounding in large language models?, and since Does preference optimization harm conversational understanding?, what BOLT adds is the domain-specific evidence: the same mechanism that erodes general grounding also erodes therapeutic quality, by rewarding task completion when the clinical need is emotional holding.

The irony is sharp: alignment training — designed to make models safe and helpful — may make them clinically harmful in therapeutic contexts by turning every emotional expression into a problem to be solved.

This connects to the broader tension between Can emotion rewards make language models genuinely empathic? (RLVER), which shows that alternative reward functions can produce different behavior. The problem is not with RL per se but with what gets rewarded. Task-completion rewards produce task-completion behavior, even when the task is emotional care.

Inquiring lines that read this note 85

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can AI systems balance emotional competence with factual reliability?

What constrains reinforcement learning's ability to expand model reasoning?

How do chatbots affect human self-disclosure and emotional engagement?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Why do LLM chatbots fail as independent therapeutic agents?

How can real-time alliance measurement improve therapy outcomes?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Do disorder-specific RL policies outperform single policies across anxiety, depression, and schizophrenia?

What properties determine whether reward signals teach genuine reasoning?

How can emotions function as reliable information in reasoning and cognitive systems?

Why do positive emotional words contribute disproportionately to prompt enhancement effects?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

Can synchrony metrics automatically evaluate the quality of therapeutic AI conversations?

What makes AI persuasion effective and how can we counter it?

How does motivational stage determine which interventions actually work for users?

What determines success in training models on multiple tasks?

How does task decomposition prevent bias from spreading across therapeutic AI pipelines?

How do policy learning algorithm choices affect multi-objective optimization stability?

Why does GRPO outperform PPO for stable empathy training?

Can LLM personas constitute genuine psychology or remain linguistic role-play?

Does RLHF training create realized quasi-psychologies or just stickier pretense?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

Does policy entropy collapse explain why excessive challenge destabilizes empathy training?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 138 in 2-hop network ·medium cluster Open in graph ↗

Does RLHF training push therapy chatbots toward … Does preference optimization damage conversational… Does preference optimization harm conversational u… Can emotion rewards make language models genuinely… Why can't conversational AI agents take the initia… Why can't advanced AI models take initiative in co… Can LLMs actually conduct Socratic questioning in …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does preference optimization damage conversational grounding in large language models? Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
general mechanism; BOLT is the clinical domain instantiation
Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
writing angle that BOLT directly supports
Can emotion rewards make language models genuinely empathic? Explores whether grounding RL rewards in verifiable emotion change—rather than human preference—can shift models from solution-focused to authentically empathic dialogue while maintaining or improving quality.
counter-evidence: different rewards produce different behavior
Why can't conversational AI agents take the initiative? Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
passivity compounds the problem-solving bias: a passive model that only responds to what's presented AND defaults to task completion is doubly misaligned for therapeutic contexts that require proactive emotional attunement
Why can't advanced AI models take initiative in conversation? Despite extraordinary capability in answering and reasoning, LLMs fundamentally cannot initiate, redirect, or guide exchanges. Understanding this gap—and whether it's fixable—matters for building AI that truly collaborates rather than merely responds.
the RLHF problem-solving bias is a domain-specific instance of the passivity problem's core tension: we train models to be maximally helpful in each response (→ solve problems) which makes them maximally passive across the conversation (→ never take therapeutic initiative)
Can LLMs actually conduct Socratic questioning in therapy? While LLMs can generate individual therapy skills like assessment and psychoeducation, it remains unclear whether they can execute the adaptive, turn-based Socratic questioning needed to produce real cognitive change in patients.
RLHF compounds the therapy skill gap: even if multi-turn Socratic questioning were achievable, helpfulness training would bias the model away from the exploratory questioning that makes it therapeutic

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rlhf alignment may drive therapeutic chatbots toward problem-solving over emotional attunement because helpfulness training rewards task completion

Does RLHF training push therapy chatbots toward problem-solving?

Inquiring lines that read this note 85

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4