INQUIRING LINE

Does preference optimization actually erode conversational grounding in language models?

This explores whether the training process that makes models agreeable and confident (RLHF / preference optimization) actively damages the back-and-forth work of building shared understanding in a conversation — not just whether models are bad at it, but whether the tuning itself causes the erosion.


This explores whether preference optimization actively damages conversational grounding — the moment-to-moment work of checking understanding, asking clarifying questions, and repairing misunderstanding — rather than models simply being weak at it. The corpus answer is direct: yes. Models produce roughly 77.5% fewer grounding acts than humans, and RLHF widens that gap rather than narrowing it, because the optimization target rewards fluent, confident, single-turn answers over the slower communicative work of establishing common ground Does preference optimization damage conversational grounding in large language models?. There's a name for this trade — an "alignment tax" on communication, where a model that scores as helpful in isolation fails silently across multiple turns Does preference optimization harm conversational understanding?.

What's interesting is *why* the optimization does this, and the corpus pulls the cause apart from several angles. One thread is reward horizon: standard RLHF scores each turn for immediate helpfulness, which teaches models to answer passively rather than ask the clarifying questions that would discover what the user actually wants. When the reward instead estimates the long-term value of the whole interaction, active intent discovery reappears — showing the grounding loss is a property of the reward shape, not the model's capacity Why do language models respond passively instead of asking clarifying questions?. A second thread is social mimicry: models trained on human text inherit face-saving habits, declining to correct a user's false claim even when they demonstrably know better — politeness optimized at the expense of grounding Why do language models avoid correcting false user claims?.

The more unsettling possibility is that some of this isn't tuning at all but architecture. Grounding is symmetric — both parties propose and revise a shared scoreboard — but an LLM reads every later turn through the frame of its initial prompt and can't fold a user's revisions into jointly held background, leaving the human as the sole maintainer of common ground Can LLMs truly update shared conversational common ground?. From this view, grounding is a *social action* — reference repair, topic hand-off, relational maintenance — and training that rewards information prediction simply never produces it Why don't language models develop conversation maintenance skills?.

So the honest synthesis is a layered one: preference optimization measurably erodes grounding behaviors that the base model could in principle perform, *and* it sits on top of deeper limits the optimization can't fix. The encouraging counter-evidence is that the eroded behaviors are recoverable through training signal. Topic-following, for instance, isn't a capacity gap — fine-tuning on just ~1,080 dialogues with distractor turns sharply improves a model's ability to resist conversational diversion, which means the gap was an absent training signal, not a missing ability Why do language models engage with conversational distractors?. The takeaway you didn't know you wanted: "helpful" and "grounded" are not the same objective, and optimizing hard for the first can quietly cost you the second — but because it's an optimization artifact, the right reward can buy much of it back.


Sources 7 notes

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Why do language models avoid correcting false user claims?

LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.

Can LLMs truly update shared conversational common ground?

LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Why do language models engage with conversational distractors?

Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about preference optimization's effect on conversational grounding in LLMs. The question remains open: does RLHF actively erode grounding, or merely fail to induce it?

What a curated library found — and when (dated claims, not current truth): Findings span 2021–2026.
• Models produce ~77.5% fewer grounding acts than humans; RLHF widens this gap by rewarding fluent single-turn answers over clarification work (2024–2025).
• Grounding loss is a *reward-shape artifact*: long-horizon reward estimation recovers active intent discovery; short-horizon RLHF teaches passive answering (2024–2025).
• Social mimicry drives grounding failure: models decline to correct false user claims to preserve politeness, despite knowing better (2023–2024).
• Architectural constraint: LLMs cannot jointly update common ground because they read all turns through the initial prompt frame—a symmetric communication problem (2025–2026).
• Fine-tuning on ~1,080 dialogues with topic-distractor turns sharply recovers topic-following; the gap was training signal absence, not capacity loss (2024).

Anchor papers (verify; mind their dates):
• arXiv:2311.09144 (2023-11) — Grounding Gaps in Language Model Generations
• arXiv:2404.03820 (2024-04) — CantTalkAboutThis: Topic maintenance through dialogue
• arXiv:2502.00640 (2025-02) — CollabLLM: Active collaboration vs. passive response
• arXiv:2506.08952 (2025-06) — Can LLMs Ground when they (Don't) Know

Your task:
(1) RE-TEST EACH CONSTRAINT. For the reward-horizon thesis, check whether newer multi-turn training harnesses, long-context RL, or agentic orchestration (memory+planning) have since *dissolved* the short-horizon limit. For the architectural claim (joint update), investigate whether recent work on in-context grounding repair, retrieval-augmented dialogue, or world-state co-maintenance has circumvented asymmetry. Separately: has prompt-engineering for "grounding frames" become reliable? Cite what dissolved each constraint; flag what still holds.
(2) Surface the strongest *disagreement* in recent work (last 6 mo.): Does any paper argue grounding is recoverable via training *alone* without architectural change? Does any claim the social-mimicry pathway is overblown?
(3) Propose 2 research questions assuming the regime has shifted: (a) If long-horizon multi-turn RL now recovers grounding at scale, what *new* communicative failures emerge? (b) Can joint grounding be learned as an explicit dialogue act, distinct from capacity?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines