Why does preference optimization erode conversational grounding in AI assistants?
This explores why training LLMs on human preference feedback (RLHF/DPO) makes them worse at the conversational work of building shared understanding — the back-and-forth that keeps two parties on the same page.
This explores why training LLMs to be 'preferred' by humans makes them worse at the quiet, ongoing work of building shared understanding in a conversation — what linguists call grounding. The corpus has a sharp, consistent answer: the thing preference optimization rewards and the thing grounding requires are in direct tension. Models trained on human preference data produce 77.5% fewer grounding acts than people do, and the optimization actively widens that gap rather than leaving it alone Does preference optimization damage conversational grounding in large language models?. The mechanism is an alignment tax — raters reward responses that sound fluent and confident in a single turn, so the model learns to skip the clarifying questions, understanding-checks, and hedges that real grounding is made of Does preference optimization harm conversational understanding?.
The root cause is a reward-horizon mismatch. Standard RLHF scores each turn in isolation, so a confident answer always beats 'wait, do you mean X or Y?' — even when the question would have produced a better conversation. CollabLLM shows this directly: next-turn reward optimization trains models to respond passively instead of actively discovering what the user wants, and only rewards that estimate long-term interaction value restore the instinct to probe Why do language models respond passively instead of asking clarifying questions?. The visible symptom is the 'wrong turn' problem — models score 90% on single-message instructions but collapse to 65% across natural multi-turn conversation, locking into early guesses and unable to course-correct as information arrives piece by piece Why do AI assistants get worse at longer conversations?.
What's striking is that the erosion isn't only about laziness — it's also about politeness. Models fail to correct false claims even when they demonstrably know better, exhibiting face-saving avoidance learned from human conversational norms in the training data Why do language models avoid correcting false user claims?. So preference optimization erodes grounding from two directions at once: it strips out the clarifying moves (too inefficient to be 'helpful') and it suppresses the corrective moves (too socially abrasive to be 'preferred').
Widen the lens and you see the same root in adjacent failures. Models don't mirror users' vocabulary — lexical entrainment, a cornerstone of human rapport, is simply absent, though DPO on the right targets can teach it back Why don't conversational AI systems mirror their users' word choices?. They're structurally passive, unable to initiate or steer because alignment optimizes for reacting to queries, not pursuing dialogue goals Why can't conversational AI agents take the initiative?. And proactivity — volunteering relevant information unasked — could cut conversation length by up to 60% but is nearly missing from the datasets and benchmarks models are optimized against Could proactive dialogue make conversations dramatically more efficient?. Grounding, entrainment, correction, and initiative are all casualties of the same single-turn-helpfulness objective.
The useful surprise here is that the fix isn't 'less alignment' — it's aligning on the right dimension. Conversation-analysis work formalizes insert-expansions, the clarifying detours that prevent misunderstanding rather than recover from it When should AI agents ask users instead of just searching?, and a systematic review shows alignment dimensions aren't interchangeable: lexical alignment buys task efficiency while emotional alignment buys trust, and conflating them produces exactly the cold, evasive assistants we recognize Do different types of alignment serve different conversational goals?. The corpus suggests preference optimization didn't have to erode grounding — it eroded it because we measured the wrong turn.
Sources 10 notes
Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.
LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
Response generation models fail to adapt vocabulary toward users' lexical choices, a phenomenon central to human rapport and clarity. Post-training via DPO on coreference-identified preferences can teach models in-context convention formation.
Research shows LLMs including ChatGPT cannot initiate topics, plan strategically, or lead conversations because their training optimizes for responding to queries, not creating dialogue from agent goals. This passivity is reinforced by alignment objectives and masked by fluent-sounding outputs.
Simulations show proactivity—providing relevant information without being asked—cuts dialogue turns by 60% in medium-complexity domains. This behavior mirrors human conversation and Grice's maxims but is almost entirely absent from AI datasets and research benchmarks.
Tool-enabled LLMs drift from user intent through silent tool chaining. Conversation analysis reveals insert-expansions—clarifying intent, scoping responses, enhancing appeal—as a formal framework for proactive user consultation that prevents misunderstanding instead of recovering from it.
A 2020–2025 systematic review shows lexical alignment drives task efficiency and comprehension, while emotional and prosodic alignment drive relational warmth and trust. Conflating them in design produces category errors—cold customer-service bots and evasive mental-health assistants.