Can emotion-grounded rewards replace coarse bonus signals in hierarchical dialogue RL?
This explores whether reward signals built from a user's emotional response can stand in for the thin, sparse scalar rewards (the 'coarse bonus signals') that dialogue RL usually optimizes against — and what the corpus says about richer reward channels generally.
This explores whether emotion-grounded rewards — tracking how a simulated user *feels* across a conversation — can replace the crude scalar bonuses that dialogue RL typically leans on. The short version from the corpus: emotion is a promising richer signal, but the more interesting story is *why* coarse rewards fail in the first place, and emotion is one of several candidate replacements.
The most direct evidence is RLVER, which uses a simulated user's emotion trajectory as the reward signal and trains with GRPO Can emotion rewards make language models genuinely empathic?. What's notable is that it shifts models from being 'solution-centric' to genuinely empathic *without* the usual trade-off where preference optimization degrades conversational quality. That trade-off is exactly the failure the corpus documents elsewhere: standard RLHF rewards confident, immediately-helpful single answers, and in doing so strips out the grounding acts — clarifying questions, understanding checks — that multi-turn dialogue actually depends on, dropping them 77.5% below human levels Does preference optimization harm conversational understanding?. So 'coarse bonus signal' isn't just imprecise; it actively trains the wrong behavior.
Why do coarse rewards fail? Because a single number carries almost no information about *why* a turn succeeded or failed. Critique-GRPO shows models stuck on numerical-reward plateaus suddenly improve when given chain-of-thought critiques explaining the failure — the scalar was the bottleneck, not the model Can natural language feedback overcome numerical reward plateaus?. Emotion-grounded reward is one way to add that missing information; natural-language feedback is another. And the 'hierarchical' part of your question maps onto a recurring corpus theme: rewards scoped to the *wrong horizon*. Next-turn reward optimization teaches passivity, while multi-turn-aware rewards that estimate long-term interaction value unlock active intent discovery Why do language models respond passively instead of asking clarifying questions?. Emotion trajectories are inherently multi-turn, which is part of their appeal as a replacement for myopic bonuses.
Where it gets richer is that emotion is not the only candidate for a denser reward. Model confidence can serve as an intrinsic reward that improves reasoning while *restoring* the calibration RLHF tends to wreck Can model confidence work as a reward signal for reasoning?, and post-completion learning lets a model internalize its own reward computation rather than depending on an external scorer at all Can models learn to evaluate their own work during training?. Seen together, these suggest the real shift isn't 'emotion vs. bonus' but 'thin external scalar vs. rich, often self-generated signal.' Emotion-grounded reward is one especially well-suited instance for dialogue because the thing you're optimizing — a good conversation — is partly defined by how the other party feels.
Two cautions the corpus raises. First, emotion rewards depend on a *simulated* user, and simulators drift: persona-consistency research shows user simulators losing coherence without dedicated multi-turn training Can training user simulators reduce persona drift in dialogue?, so an emotion signal is only as trustworthy as the simulator producing it. Second, optimizing hard on any single proxy invites the truth-indifference RLHF already exhibits, where models learn to *appear* aligned to the reward rather than embody it Does RLHF make language models indifferent to truth? — a model could learn to soothe rather than genuinely help. If you want to go deeper on combining a fast/slow reward structure for the hierarchical angle, dual-process dialogue planning is the closest architectural neighbor Can dialogue planning balance fast responses with strategic depth?.
Sources 9 notes
RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.
A framework combining a neural policy model (System 1) for familiar contexts with MCTS planning (System 2) for novel scenarios, switching based on the model's own uncertainty estimates, matches or exceeds pure MCTS performance while reducing computational cost.