SYNTHESIS NOTE

Topics›Conversation Topics Dialog›this note

Why do language models fail in gradually revealed conversations?

Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.

Synthesis note · 2026-02-22 · sourced from Conversation Topics Dialog

Laban et al. (2025) conduct large-scale simulation experiments (200,000+ conversations) comparing LLM performance in single-turn fully-specified vs. multi-turn underspecified settings across six generation tasks. The finding is stark: all top open- and closed-weight LLMs exhibit significantly lower performance in multi-turn conversations, with an average drop of 39%.

The performance degradation decomposes into two components. The minor one is aptitude loss — models are slightly less capable when instructions arrive incrementally. The major one is unreliability increase — when models take a wrong turn, they get lost and do not recover. This is the "lost in conversation" phenomenon.

Four specific failure behaviors drive the degradation:

Overly verbose responses — models generate too much too early
Premature solution proposals — attempting final answers before sufficient information arrives
Incorrect assumptions — filling in underspecified details with guesses
Over-reliance on previous attempts — locking in to early (wrong) answers

The SHARDED simulation methodology is key: it transforms existing single-turn instructions into shards revealed one per turn, enforcing gradual disclosure. The CONCAT control confirms the effect is specifically about underspecification and multi-turn nature, not rephrasing. The drop appears even in two-turn conversations and across all LLMs from 8B to state-of-the-art.

Agent-like mitigations (RECAP: final-turn recapitulation; SNOWBALL: turn-level reminders) recover only 15-20% of the loss. The authors argue LLMs should natively support multi-turn interaction — relying on agent frameworks to preprocess is insufficient. Since Why can't conversational AI agents take the initiative?, this passivity compounds: models neither lead the conversation to gather missing information nor recover when their assumptions prove wrong.

The underspecification tested here is not adversarial — it reflects "the principle of least effort" (Zipf), a natural tendency in human conversation. Users routinely start vague and refine. The models' failure is thus a failure at normal conversation, not edge cases. Since Does preference optimization harm conversational understanding?, the premature assumptions are not random — they are incentivized by RLHF training that rewards confident single-turn answers over grounding acts like clarification. The alignment tax produces models that guess rather than ask, and the lost-in-conversation phenomenon is the multi-turn consequence. More specifically, since Why do language models sound fluent without grounding?, the 77.5% reduction in grounding acts means models skip the clarification and repair mechanisms that would prevent the lock-in to incorrect assumptions. And since Do language models actually build shared understanding in conversation?, the premature assumptions are a specific form of this: filling in underspecified details with guesses is precisely presuming common ground that does not yet exist.

The STORM framework reframes this from a model failure to a fundamental interaction design problem. Since How do users actually form intent when prompting AI systems?, underspecification is not laziness — it reflects that users genuinely cannot articulate their full intent upfront. The "gulf of envisioning" means users lack the vocabulary and conceptual framework to specify what they want, while the AI lacks the ability to help them develop it. This deepens the lost-in-conversation diagnosis: models don't just fail at underspecified inputs — they fail at the process through which intent matures from vague to specific.

MultiChallenge (2025) identifies four specific multi-turn challenge categories that all frontier models fail. Despite near-perfect scores on existing multi-turn benchmarks, all frontier models achieve less than 50% accuracy on MultiChallenge (Claude 3.5 Sonnet at 41.4%). The four categories: (1) instruction retention — following instructions from the first turn throughout the entire conversation; (2) inference memory of user information — recalling and connecting details scattered across previous turns; (3) reliable versioned editing — helping users revise materials through back-and-forth iterations; (4) self-coherence — maintaining consistency with model responses in conversation history and avoiding sycophancy. Each category requires simultaneous instruction-following, context allocation, and in-context reasoning, confirming that multi-turn failure is a compound capability gap, not a single missing skill. Source: Arxiv/Evaluations.

Inquiring lines that read this note 115

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do multi-turn conversations degrade AI intent and coherence?

How should dialogue recommender systems manage conversation history and state?

Is embodied interaction necessary for language meaning and genuine agency?

Why does frame-activation matter more than word-by-word composition?

Do language models understand semantics or rely on pattern matching?

How does context collapse affect what language models can meaningfully communicate?

How can AI systems learn from failures without cascading errors?

How do language models establish social grounding in human dialogue?

How can language models sustain linguistic synchrony and intersubjectivity during dialogue?

How do formal dialogue structures reveal conversation coherence mechanisms?

How do prompt structure and constraints affect model instruction reliability?

Why does self-revision increase model confidence while degrading accuracy?

Why does self-critiquing actually reduce plan quality in language models?

Why do LLM chatbots fail as independent therapeutic agents?

How do language models inherit human biases from training data?

Why do users systematically overrely on confident LLM outputs across languages?

Why do language models reinforce false assumptions instead of correcting them?

How should conversational agents balance goal-driven initiative with user control?

Can next-token prediction alone produce genuine language understanding?

What articulatory information do speech signals carry that text cannot?

What critical LLM failures do standard benchmarks hide?

Why do benchmark improvements fail to reflect actual reasoning quality?

What makes dialogue-based explanation more successful than monologue?

How do dialogue acts and explanation moves interact to predict understanding success?

Why do language models struggle with implicit discourse relations?

Do language models learn genuine linguistic structure or just surface patterns?

How should dialogue systems represent uncertainty from noisy speech input?

How do probabilistic dialogue systems handle ASR errors differently?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Can prompting inject entirely new knowledge into language models?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Can multi-turn reinforcement learning improve tool use in language models?

Why do agents confidently report success despite actually failing tasks?

What makes action-producing models fail in ways text models typically do not?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How does rhetorical adaptation affect LLM persuasion and detectability?

Why do LLMs apply face-saving over accurately tracking resistance signals?

Can prompting strategies overcome LLM biases without model fine-tuning?

What properties determine whether reward signals teach genuine reasoning?

How does credit assignment work across many sequential decision steps in language models?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

How can LLM user simulators model realistic goal-driven conversation?

Why does single-turn Q&A framing not match real user deployment patterns?

How should we design LLM systems to maintain alignment and control?

How does this differ from using LLMs as the policy itself?

How can models identify insufficient information and respond appropriately without guessing?

Why do models struggle with asking questions in multi-turn conversational reasoning tasks?

What causes silent corruption to amplify through delegated workflows?

What causes silent document corruption in long LLM workflows?

What structural advantages do diffusion language models offer over autoregressive methods?

What memory architectures best support persistent reasoning across extended interactions?

Why do language models ignore condensed memory even when it is the only memory?

What capability tradeoffs emerge when scaling model reasoning abilities?

How do training priors constrain what context information can override?

Why is in-context learning brittle to the order of examples presented?

Related concepts in this collection 15

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

26 direct connections · 220 in 2-hop network ·medium cluster Open in graph ↗

Why do language models fail in gradually reveale… Why can't conversational AI agents take the initia… Why do language models respond passively instead o… Can models learn to ask clarifying questions inste… Do models fail worse when their own errors fill th… How do users actually form intent when prompting A… Why do users drift away from their original inform… Does preference optimization harm conversational u… Why do language models sound fluent without ground…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why can't conversational AI agents take the initiative? Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
passivity prevents recovery; models can't redirect when lost
Why do language models respond passively instead of asking clarifying questions? Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
next-turn rewards are the training cause of premature solution proposals
Can models learn to ask clarifying questions instead of guessing? Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
proactive questioning is exactly the missing capability
Do models fail worse when their own errors fill the context? As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
the lock-in mechanism: prior errors in context amplify future error rates
How do users actually form intent when prompting AI systems? Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
underspecification reflects genuine inability to articulate intent, not user laziness
Why do users drift away from their original information need? When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
ASK is the user-side cause of the underspecification that triggers premature assumptions: users in an anomalous knowledge state produce the vague queries that models cannot handle
Does preference optimization harm conversational understanding? Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLHF incentivizes premature assumptions by rewarding confident answers over clarification; the training cause of the lost-in-conversation phenomenon
Why do language models sound fluent without grounding? Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?
the 77.5% grounding act reduction means models skip the communicative work that would prevent lock-in to incorrect assumptions
Do language models actually build shared understanding in conversation? When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
premature assumptions under underspecification are a specific form of presuming common ground that does not yet exist
Can language models track how minds change during persuasion? Do LLMs understand evolving mental states in persuasive dialogue, or do they only capture fixed attitudes? This explores whether models can update their reasoning as a person's beliefs shift across conversation turns.
the static/dynamic ToM gap is a cognitive mechanism for getting lost: models can snapshot initial user state but cannot track how it evolves across turns, causing assumptions to diverge from the user's actual shifting needs
Can full episode rewards per step enable better credit assignment? Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
training-level fix: MS-GRPO's cumulative episode reward teaches models that early-turn decisions have downstream consequences, directly addressing the premature-commitment failure where models lock in to assumptions they cannot revise
Does including all conversation history actually help retrieval? Conversational search systems typically use all previous context to understand current queries. But do topic switches in multi-turn conversations inject noise that degrades performance rather than helps it?
the retrieval-side fix for the lost-in-conversation problem: selective history prevents topic-switch contamination from making the current query context incoherent; the model gets lost partly because irrelevant prior turns warp the effective context
Can models identify what information they actually need? When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
the diagnostic root: models that solve fully-specified problems at 40-50% on clarification tasks cannot identify what's missing when instructions arrive gradually; the information-gathering deficit precedes and causes the premature assumptions
Why do reasoning models overthink ill-posed questions? Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
the behavioral mechanism: when underspecification creates ill-posed situations, reasoning models overthink rather than recognizing incompleteness — producing the verbose, non-recovering responses that characterize being "lost"
Why do AI agents miss most of what users actually want? UserBench explores why current models align with user intent only 20% of the time, even when users reveal preferences across multiple turns. The question examines whether agents can learn to actively clarify ambiguous or evolving goals.
UserBench quantifies the downstream cost of premature assumptions: the 20% full-alignment rate reflects models that guess rather than elicit, and the <30% preference discovery rate confirms models cannot recover from initial misunderstandings

Why do language models fail in gradually revealed conversations?

Inquiring lines that read this note 115

Related concepts in this collection 15

Related papers in this collection 8

Search by related questions 5