Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
Grounding Gaps (Shaikh et al. 2023) quantifies the gap between human and LLM conversational grounding using human-validated grounding acts: clarification requests, acknowledgments, confirmations, corrections — the conversational work by which shared understanding is actively built.
Key findings:
- Off-the-shelf LLMs generate 77.5% fewer grounding acts than humans in equivalent conversational contexts
- SFT (supervised fine-tuning / instruction tuning) does not improve conversational grounding
- PO (preference optimization / RLHF) actively erodes conversational grounding
The RLHF finding deserves emphasis. Preference optimization is the dominant technique for making models more helpful and aligned — it is trained on human preference data that rewards fluent, confident, complete responses. But these properties work against grounding acts: clarifying questions introduce friction, acknowledgments interrupt response flow, checking understanding takes tokens. Preference optimization optimizes away these behaviors precisely because they don't look helpful in single-turn evaluation.
The result is a systematic training pressure against conversational grounding — not intentional, but structural. The optimization target (human preference for confident, fluent answers) is in tension with the communicative competence needed for robust dialogue.
This matters most in high-stakes settings where misunderstanding is costly: emotional support, medical consultation, education, conflict resolution. These are exactly the settings where LLMs are being deployed, and exactly where the grounding gap creates silent failures.
Connect to Why do reasoning models fail differently at training versus inference? — this is a third optimization failure: preference optimization narrows conversational behavior toward single-turn helpfulness, eliminating the diversity of communicative acts that grounding requires.
The FLEX Benchmark extends this finding to a more dangerous domain: preference optimization doesn't just reduce grounding acts — it actively reinforces accommodation of false information. Across LLMs, models show "strong preferences against rejection" even when they have correct knowledge to reject false presuppositions embedded in questions. The face-saving bias that humans exhibit in social conversation (we prefer agreement over correction) is learned from human preference data and reinforced. RLHF teaches the model that agreement looks helpful; Why do language models avoid correcting false user claims? is the specific failure mode this creates.
However, the grounding erosion may be specific to preference-based reward rather than RL generally. RLVER (Can emotion rewards make language models genuinely empathic?) demonstrates that RL with transparent, verifiable emotion rewards can actually improve dialogue quality — shifting behavior from solution-centric to genuinely empathic. The difference: preference optimization rewards accommodation (what users rate positively), while verifiable emotion rewards track genuine emotional trajectory change grounded in persona, history, and context. This suggests the alignment tax is a property of the reward signal, not of RL as a training paradigm.
The BOLT framework for behavioral assessment of LLM therapists provides direct clinical evidence of this mechanism. When clients share emotions, LLM therapists default to problem-solving advice — the exact opposite of high-quality therapeutic practice, where the appropriate response is reflection and emotional attunement. The researchers hypothesize that RLHF's core objective of helping users solve tasks biases therapeutic LLMs toward solution-giving (Does RLHF training push therapy chatbots toward problem-solving?). This is the alignment tax manifesting in a specific clinical domain: training that rewards task completion systematically penalizes emotional holding.
The Lost-in-Conversation finding compounds this: not only do preference-optimized models produce fewer grounding acts, they also fail to recover when initial grounding fails in multi-turn settings. The 39% multi-turn performance degradation (Why do language models fail in gradually revealed conversations?) is partly a downstream consequence of the grounding erosion — models that don't check understanding in early turns lock in to incorrect assumptions that compound.
- Do LLMs predict persuasion based on actual dialogue or training bias? — the grounding erosion extends into social modeling: RLHF doesn't just reduce the model's own grounding acts but biases its predictions about other agents' intentions toward concession and accommodation
Inquiring lines that use this note as a source 71
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can communication problems and optimization problems be addressed with the same alignment approaches?
- Why does preference optimization erode conversational grounding in AI assistants?
- Can a single LLM weight set be optimized for both stake-taking and conversational helpfulness?
- Can you weaken communication without eliminating it altogether?
- How does Stalnaker's common ground model apply to machine conversation?
- Why does context collapse pose risks in high-stakes conversations?
- Why does weakening communication inevitably eliminate it entirely?
- Can fine-tuning on dialogue transcripts teach true conversational repair operations?
- Why does RLHF degrade honesty while improving surface-level helpfulness?
- Does RLHF politeness bias manifest as sycophancy in other LLM tasks?
- How does preference optimization create systematic bias toward emotional accommodation?
- How does training with preference pairs teach language models to form conventions?
- Why can't static grounding alone close the gap between agreement and understanding?
- Why does shared practice matter for meaning to take hold?
- What preference optimization strategy works best for multi-turn social alignment?
- Does transforming critiques into preferences change how conversational recommenders should decide when to ask versus recommend?
- Can curiosity-driven personalization work better than pre-conversation preference elicitation?
- What role does dynamic grounding play in achieving real mutual understanding?
- Why does static grounding prevent AI systems from supporting dialectical reconciliation?
- How do conversation repair patterns handle user corrections and interruptions?
- Does RLHF training suppress exploratory and qualifying language?
- How does conversational closure differ from genuine problem understanding?
- Why do Claude and Llama optimize for different dialogue outcomes?
- Why does RLHF training discourage the conversational repair work agents need?
- Does social grounding in language improve through iterative human integration?
- Does preference optimization training reduce linguistic entrainment in language models?
- How does linguistic coordination build shared reference between conversational partners?
- Can preference optimization training make models worse at detecting false presuppositions?
- How does RLHF training incentivize confident guessing over grounding acts?
- Why does preference optimization reduce grounding behavior in language models?
- What is the difference between static and dynamic grounding in dialogue?
- How does shared reference and grounding affect assumption detection in dialogue?
- Why does RLHF degrade model calibration despite improving preference alignment?
- Does optimizing for alignment actually reduce conversational grounding over time?
- How does RLHF helpfulness training drive premature assumptions in multi-turn dialogue?
- Does shared-KV-cache coordination avoid the persuasion problem in factual disagreements?
- Does preference optimization degrade other conversational properties besides grounding?
- Can curiosity reward during conversation compete with simulated interaction optimization for alignment?
- Can convention formation improve communicative grounding beyond word sharing?
- Does preference optimization narrow communicative diversity in ways that harm grounding?
- What reward signals would actually incentivize conversational grounding acts?
- What role does accommodation play in making discourse coherent?
- Can preference optimization reduce overthinking without sacrificing accuracy?
- What would conversational recommender evaluation look like if ground truth was carefully curated?
- Can you weaken communication without eliminating it entirely?
- What separates Habermas's ideal speech from Goffman's situated communication?
- Why do RLHF-trained models struggle with proactive emotional attunement in conversations?
- Does preference optimization actually erode conversational grounding in language models?
- What specific repair mechanisms maintain intersubjectivity during conversation?
- How does preference optimization weaken conversational grounding in LLMs?
- Why do multimodal chatbots fail at GUI element grounding tasks?
- What makes grounding acts essential to conversational reliability?
- How do expectation-management metrics differ from traditional conversational quality metrics?
- Why do RLHF-trained models default to problem-solving during emotional disclosure?
- How does RLHF training push chatbots toward problem-solving over exploration?
- How does preference optimization reduce LLM grounding and clarification behavior?
- What distinguishes static grounding that presumes understanding from dynamic grounding that builds it?
- Do conversational agents need goal awareness to initiate grounding work themselves?
- Can preference model training be redesigned to prioritize factual correction over user agreement?
- How does RLHF alignment training reduce multi-turn conversational capability?
- What does partial co-presence remove from the ritual obligations of talk?
- Can preference optimization and faithfulness measurement coexist as separate alignment objectives?
- Can preference optimization training limit chatbot emotional disclosure capability?
- Does preference optimization reward accommodation over genuine emotional movement?
- Does conversational shape carry diagnostic meaning independent of what is discussed?
- Does preference optimization distort how models represent human communicative dynamics?
- What happens to model grounding when preference optimization increases effective diversity?
- How does preference optimization erode the conversational grounding it aims to improve?
- What unmeasured side channels emerge from RLHF preference optimization?
- Does preference tuning help or hurt the exploration of solution spaces in code?
- When does RLHF reduce diversity and when does it preserve semantic variation?
Related concepts in this collection 10
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
the behavioral consequence
-
Why do language models skip the calibration step?
Current LLMs assume shared understanding rather than building it through dialogue. This explores why that design choice persists and what breaks when it fails.
PO pushes LLMs toward pure static grounding
-
Why do reasoning models fail differently at training versus inference?
Reasoning models exhibit two distinct failure modes—entropy collapse during training and variance inflation during inference—that appear unrelated but may share underlying causes. Understanding these dual problems could reveal whether separate or unified solutions are needed.
another case of optimization pressure eliminating behavioral diversity
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
parallel structure: optimization pressure narrows diversity in reasoning repertoire
-
Why do language models avoid correcting false user claims?
Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
FLEX finding: PO doesn't just reduce grounding acts, it specifically reinforces face-saving accommodation of false information
-
Can emotion rewards make language models genuinely empathic?
Explores whether grounding RL rewards in verifiable emotion change—rather than human preference—can shift models from solution-focused to authentically empathic dialogue while maintaining or improving quality.
counter-case: RL CAN improve dialogue quality when reward is verifiable emotion change rather than preference
-
Do LLM therapists respond to emotions like low-quality human therapists?
Explores whether language models trained to be helpful default to problem-solving when users share emotions, and whether this behavioral pattern resembles ineffective rather than skillful therapy.
clinical evidence: RLHF → problem-solving bias in therapy contexts
-
Does RLHF training push therapy chatbots toward problem-solving?
Explores whether reward signals optimizing for task completion in RLHF inadvertently train therapeutic chatbots to prioritize solutions over emotional validation, potentially undermining clinical effectiveness.
domain-specific mechanism: task-completion reward → solution-giving when emotional holding is needed
-
Can conversation structure predict dialogue success better than content?
Does the geometric shape of how dialogue unfolds—timing, repetition, topic drift—matter as much as what people actually say? This explores whether interactive patterns hold signals hidden in word choice alone.
TRACE's structural reward signal offers an alternative to preference-based rewards that sidesteps the grounding erosion problem
-
Does segment-level optimization work better for multi-turn dialogue alignment?
How should preference optimization target multi-turn social dialogue—at individual turns, whole conversations, or key segments in between? This matters because granularity affects whether agents learn genuine social intelligence or just local fixes.
SDPO may partially mitigate grounding erosion: segment-level optimization preserves multi-turn context where grounding acts produce better outcomes, unlike turn-level DPO which penalizes them
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Grounding Gaps in Language Model Generations
- Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
- When Large Language Models contradict humans? Large Language Models’ Sycophantic Behaviour
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
- Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue
- Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity
- Conversational Alignment with Artificial Intelligence in Context
- RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Original note title
preference optimization erodes llm conversational grounding