SYNTHESIS NOTE
Psychology, Society, and Alignment

Does RLHF training push therapy chatbots toward problem-solving?

Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.

Synthesis note · 2026-02-22 · sourced from Psychology Chatbots Conversation
What makes therapeutic chatbots actually work in clinical practice?

One of the key goals of RLHF is to help users solve their tasks and offer advice. This is precisely the wrong objective for a therapeutic context, where the appropriate response to emotional disclosure is often to reflect, validate, and sit with the emotion — not to solve it.

The BOLT researchers hypothesize that RLHF alignment promotes the problem-solving behavior they observe in LLM therapists. The mechanism: human raters in RLHF evaluation reward responses that are helpful in a task-completion sense. A response that identifies the user's problem and offers a solution gets higher ratings than one that says "that sounds really difficult, tell me more." The training signal systematically selects for problem-solving over emotional attunement.

This is the alignment tax operating in a specific clinical domain. Since Does preference optimization damage conversational grounding in large language models?, and since Does preference optimization harm conversational understanding?, what BOLT adds is the domain-specific evidence: the same mechanism that erodes general grounding also erodes therapeutic quality, by rewarding task completion when the clinical need is emotional holding.

The irony is sharp: alignment training — designed to make models safe and helpful — may make them clinically harmful in therapeutic contexts by turning every emotional expression into a problem to be solved.

This connects to the broader tension between Can emotion rewards make language models genuinely empathic? (RLVER), which shows that alternative reward functions can produce different behavior. The problem is not with RL per se but with what gets rewarded. Task-completion rewards produce task-completion behavior, even when the task is emotional care.

Inquiring lines that use this note as a source 85

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 138 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rlhf alignment may drive therapeutic chatbots toward problem-solving over emotional attunement because helpfulness training rewards task completion