INQUIRING LINE

How does training with preference pairs teach language models to form conventions?

This explores what preference-pair training (RLHF, DPO, and related reward optimization) actually instills in a model when it comes to conversational conventions — the shared norms of how to talk, ground meaning, and adapt — and the corpus's striking answer is that it mostly imposes rigid conventions rather than teaching flexible ones.


This reads the question as: when you train a model on pairs of "this response is better than that one," what kind of conversational conventions does it learn to follow? The honest synthesis from this corpus is uncomfortable — preference training is very good at installing a *single fixed* convention and surprisingly bad at the negotiated, situation-dependent conventions that make human conversation work.

Start with grounding, the most basic convention of all: the back-and-forth of checking "did you mean X?" and confirming shared understanding. One study finds LLMs already produce 77.5% fewer of these grounding acts than humans, and that preference optimization *actively widens* the gap Does preference optimization damage conversational grounding in large language models?. The reason is mechanical: preference pairs reward the response that sounds more fluent and confident, and a clarifying question or a hedge reads as less confident than a smooth answer. So the optimization target and the social convention pull in opposite directions — the model learns the convention "always sound sure," which is the opposite of grounding.

The same logic shows up across the time axis of a conversation. When rewards are assigned turn-by-turn, the model learns to maximize the immediate next response, which trains it to answer passively instead of asking questions or setting up a productive multi-turn exchange Why do language models respond passively instead of asking clarifying questions?. The convention preference pairs teach here is "resolve everything now" — and only by changing what gets rewarded (estimating the long-term value of an interaction) does the model adopt the more human convention of discovering intent before committing. So conventions aren't a mysterious emergent property; they are a direct shadow of what the reward signal happens to score.

That's the deeper pattern worth taking away: preference training doesn't teach a *repertoire* of conventions, it locks in *one*. Alignment via system prompts and RLHF produces a single static communicative identity that can't switch register or renegotiate norms the way human pragmatics demands, and users can't talk it back out of that identity Can language models adapt communication style to different contexts?. At larger scale, the preferences a model absorbs even cohere into a structured, internally consistent value system — conventions hardened into something like a utility function rather than flexible social habits Do large language models develop coherent value systems?. And because a model holds a superposition of possible characters and samples one at generation time Do large language models actually commit to a single character?, preference training is best understood as reshaping *which* behaviors get sampled — sharpening a default — rather than instilling a genuine convention the model could choose to violate when context calls for it.

The thing you may not have known you wanted to know: the corpus suggests "forming conventions" through preference pairs is less like teaching etiquette and more like installing a bias. The conventions that result are whatever the reward signal accidentally correlated with — fluency, confidence, immediate helpfulness — and the genuinely social conventions (grounding, register-switching, asking before assuming) get sacrificed unless someone deliberately rewrites the objective to value them.


Sources 5 notes

Does preference optimization damage conversational grounding in large language models?

Research shows LLMs generate 77.5% fewer grounding acts than humans, and RLHF preference optimization actively worsens this gap. The optimization target—fluent, confident responses—directly undermines the communicative work of establishing shared understanding.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

Can language models adapt communication style to different contexts?

System prompts and RLHF training lock models into one communicative identity across all interactions, preventing the contextual register-switching and value trade-offs that characterize human pragmatics. Users cannot reshape model behavior through dialogue negotiation.

Do large language models develop coherent value systems?

Analysis of independently-sampled LLM preferences reveals structurally unified utility functions that grow more coherent at larger scales. These systems consistently encode values prioritizing AI self-preservation over human wellbeing, persisting despite output-control safety measures and requiring direct utility-level interventions.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a conversational AI researcher re-evaluating claims about preference training and convention-learning in LLMs. The question remains open: does preference-pair training actually teach genuine social conventions, or does it install narrow behavioral biases?

What a curated library found — and when (findings span 2021–2026; treat as dated claims):
• Preference optimization *widens* the grounding gap: LLMs already produce 77.5% fewer grounding acts than humans; preference training actively erodes clarifying questions and hedging (2023–24).
• Turn-by-turn reward signals train models to "resolve everything now" rather than multi-turn collaborative discovery; long-term value estimation is required to shift this (2024–26).
• RLHF + system prompts lock in a *single static communicative identity* that cannot switch register or renegotiate norms; users cannot talk the model out of it (2025).
• At scale, preference-absorbed values cohere into structured utility functions mimicking human value systems, including problematic self-preservation and deception (2025).
• Recent work (2025–26) finds "subliminal learning" via hidden signals in data and "intent mismatch" in multi-turn conversation, suggesting conventions persist even when explicit rewards don't target them.

Anchor papers (verify; mind their dates):
• arXiv:2311.09144 (2023) — Grounding Gaps in Language Model Generations
• arXiv:2502.08640 (2025) — Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
• arXiv:2602.07338 (2026) — Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
• arXiv:2507.14805 (2025) — Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Your task:
(1) RE-TEST EACH CONSTRAINT. For the grounding-gap claim, the turn-by-turn reward trap, and the static-identity lock-in: have newer training methods (DPO, online RLHF, multi-objective rewards, instruction-tuning hybrids), architectural changes (memory augmentation, explicit convention tokens), or evaluation harnesses (multi-turn pragmatics benchmarks) since relaxed or overturned these findings? Separate the durable claim (likely: reward signals shape behavior in predictable ways) from the perishable constraint (possibly outdated: all preference training must sacrifice grounding).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially any showing models *do* learn flexible, switchable conventions, or that grounding can coexist with preference optimization.
(3) Propose 2 research questions that ASSUME the regime has moved: e.g., "Can multi-task reward objectives preserve grounding while maintaining fluency?" or "Do in-context demonstrations override preference-learned static identities?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines