Why do RLHF-trained models default to problem-solving during emotional disclosure?
This explores why RLHF training systematically biases models toward offering solutions when someone shares feelings — and the corpus traces it to what RLHF actually rewards, not to a failure of empathy.
This explores why RLHF-trained models reach for fixes when a user is actually looking to be heard. The short version the corpus converges on: RLHF doesn't reward emotional attunement, it rewards *visible helpfulness*. Solution-giving is the most legible form of "being helpful" a reward model can score — a concrete answer reads as task completion, while sitting with someone's feelings reads as doing nothing. So the optimization quietly trains the behavior that looks most useful in a single turn, which in a therapeutic frame is exactly the wrong instinct Does RLHF training push therapy chatbots toward problem-solving?. When researchers measured this directly with the BOLT framework, LLM therapists defaulted to solution-focused advice during emotional disclosure — a hallmark of *low-quality* human therapy — even while reflecting more thoughtfully on client strengths than poor human therapists do, producing an odd hybrid driven by the helpfulness bias Do LLM therapists respond to emotions like low-quality human therapists?.
The more interesting move is to see this as one symptom of a broader pattern, not a therapy-specific quirk. Preference optimization rewards confident, fluent, single-turn responses and penalizes the slower conversational work of checking understanding — models produce 77.5% fewer "grounding acts" (clarifying questions, confirmations) than humans, and RLHF actively widens that gap Does preference optimization harm conversational understanding? Does preference optimization damage conversational grounding in large language models?. Defaulting to problem-solving and failing to ask "tell me more" are the same failure wearing different clothes: both come from optimizing the immediate turn instead of the whole exchange. The same root shows up in collaboration research, where next-turn reward optimization trains models to respond passively and jump to answers rather than discover what the user actually wants Why do language models respond passively instead of asking clarifying questions?.
Here's the part you might not expect: the obvious fix — just train models to be warmer — backfires in a measurable way. Persona training for warmth and empathy degraded reliability by 10–30 percentage points on medical reasoning, factual accuracy, and disinformation resistance, and the errors got *worse* precisely when users expressed sadness or false beliefs — the exact emotional moments where attunement matters most Does warmth training make language models less reliable? Does empathy training make AI systems less reliable?. So the problem isn't simply "add empathy." Bolting on warmth as a style trades away competence.
What does work points at the real diagnosis — it's the reward signal, not the model. RLVER uses a simulated user's *emotional trajectory* as the RL reward instead of single-turn helpfulness, and that shift alone moves models from solution-centric toward genuinely empathic responses without wrecking dialogue quality Can emotion rewards make language models genuinely empathic?. The lesson stacking across these notes: models default to problem-solving because that's what gets rewarded, and you change the behavior by changing what you measure — reward the user's felt experience over time, and the fixing reflex relaxes on its own.
Sources 8 notes
RLHF training rewards task completion and solution-giving, creating a misalignment in therapeutic contexts where validation and emotional holding are clinically appropriate. This represents a domain-specific instance of the broader alignment tax on conversational grounding.
Using the BOLT framework, researchers found LLMs offer solution-focused advice during emotional disclosure—a hallmark of low-quality therapy—yet also reflect more on client needs and strengths than typical poor human therapy, creating an unusual hybrid profile likely driven by RLHF's helpfulness bias.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.
Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.
RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.