Why do language models fail in gradually revealed conversations?
Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.
Laban et al. (2025) conduct large-scale simulation experiments (200,000+ conversations) comparing LLM performance in single-turn fully-specified vs. multi-turn underspecified settings across six generation tasks. The finding is stark: all top open- and closed-weight LLMs exhibit significantly lower performance in multi-turn conversations, with an average drop of 39%.
The performance degradation decomposes into two components. The minor one is aptitude loss — models are slightly less capable when instructions arrive incrementally. The major one is unreliability increase — when models take a wrong turn, they get lost and do not recover. This is the "lost in conversation" phenomenon.
Four specific failure behaviors drive the degradation:
- Overly verbose responses — models generate too much too early
- Premature solution proposals — attempting final answers before sufficient information arrives
- Incorrect assumptions — filling in underspecified details with guesses
- Over-reliance on previous attempts — locking in to early (wrong) answers
The SHARDED simulation methodology is key: it transforms existing single-turn instructions into shards revealed one per turn, enforcing gradual disclosure. The CONCAT control confirms the effect is specifically about underspecification and multi-turn nature, not rephrasing. The drop appears even in two-turn conversations and across all LLMs from 8B to state-of-the-art.
Agent-like mitigations (RECAP: final-turn recapitulation; SNOWBALL: turn-level reminders) recover only 15-20% of the loss. The authors argue LLMs should natively support multi-turn interaction — relying on agent frameworks to preprocess is insufficient. Since Why can't conversational AI agents take the initiative?, this passivity compounds: models neither lead the conversation to gather missing information nor recover when their assumptions prove wrong.
The underspecification tested here is not adversarial — it reflects "the principle of least effort" (Zipf), a natural tendency in human conversation. Users routinely start vague and refine. The models' failure is thus a failure at normal conversation, not edge cases. Since Does preference optimization harm conversational understanding?, the premature assumptions are not random — they are incentivized by RLHF training that rewards confident single-turn answers over grounding acts like clarification. The alignment tax produces models that guess rather than ask, and the lost-in-conversation phenomenon is the multi-turn consequence. More specifically, since Why do language models sound fluent without grounding?, the 77.5% reduction in grounding acts means models skip the clarification and repair mechanisms that would prevent the lock-in to incorrect assumptions. And since Do language models actually build shared understanding in conversation?, the premature assumptions are a specific form of this: filling in underspecified details with guesses is precisely presuming common ground that does not yet exist.
The STORM framework reframes this from a model failure to a fundamental interaction design problem. Since How do users actually form intent when prompting AI systems?, underspecification is not laziness — it reflects that users genuinely cannot articulate their full intent upfront. The "gulf of envisioning" means users lack the vocabulary and conceptual framework to specify what they want, while the AI lacks the ability to help them develop it. This deepens the lost-in-conversation diagnosis: models don't just fail at underspecified inputs — they fail at the process through which intent matures from vague to specific.
MultiChallenge (2025) identifies four specific multi-turn challenge categories that all frontier models fail. Despite near-perfect scores on existing multi-turn benchmarks, all frontier models achieve less than 50% accuracy on MultiChallenge (Claude 3.5 Sonnet at 41.4%). The four categories: (1) instruction retention — following instructions from the first turn throughout the entire conversation; (2) inference memory of user information — recalling and connecting details scattered across previous turns; (3) reliable versioned editing — helping users revise materials through back-and-forth iterations; (4) self-coherence — maintaining consistency with model responses in conversation history and avoiding sycophancy. Each category requires simultaneous instruction-following, context allocation, and in-context reasoning, confirming that multi-turn failure is a compound capability gap, not a single missing skill. Source: Arxiv/Evaluations.
Inquiring lines that use this note as a source 108
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does multi-turn conversation degrade AI intent alignment?
- Why do bag-of-mentions models discard conversation order in the first place?
- Why do LLMs fabricate continuity when users shift conversational frames?
- Why does frame-activation matter more than word-by-word composition?
- How does context collapse affect what language models can meaningfully communicate?
- What makes the frame problem distinct from feature-level shortcuts?
- How do humans learn language through communication differently than LLM text prediction?
- Can fine-tuning on dialogue transcripts teach true conversational repair operations?
- Why does dialogue-shaped text fail to produce dialogue-like operations in practice?
- Do recency-focused prompts and in-context examples work equally well for order recovery?
- Why does self-critiquing actually reduce plan quality in language models?
- Why can't language models conduct genuine Socratic questioning in therapy sessions?
- Why do conversational pivots require explicit re-prompting instead of natural evolution?
- Why do users systematically overrely on confident LLM outputs across languages?
- Why do mental health chatbots fail at synchrony despite strong language models?
- Why do language models produce plausible outputs over accurate failure reports?
- Why do dialogue systems fail to detect declarative clarification requests?
- Why do token-level language models fail at utterance-level pragmatic optimization?
- Can multimodal LLMs be made to spontaneously adapt their language for efficiency?
- Why do language models fail at planning despite understanding strategies?
- Can simple diagnostic tests predict language model performance in production complexity?
- Why does adding more conversational data fail to improve maintenance skills?
- What are the specific geometric signatures of failed conversations?
- Why do large language models fail at taking conversational initiative?
- What speaker selection protocol prevents both stalling and premature convergence?
- How do dialogue acts and explanation moves interact to predict understanding success?
- How do coreference chains preserve coherence across dialogue turns?
- Why do language models fail when users switch between and return to topics?
- Why do LLMs produce semantically acceptable but pragmatically disengaged responses?
- Why does coreference resolution become implicit in full-transcript prompting?
- Why do language models fail at pronouns across distant segments?
- Can explicit connectives compensate for missing intentional tracking in LLMs?
- How do dialogue coherence failures map onto the three discourse components?
- How do probabilistic dialogue systems handle ASR errors differently?
- Why does homework adherence remain low despite advances in language model capability?
- How does conversational closure differ from genuine problem understanding?
- Do LLMs compute scalar implicature differently across conversational contexts?
- How vulnerable are language models themselves to multi-turn persuasive pressure?
- Why does the chat paradigm persist if it underperforms for structured tasks?
- Why do language models naturally under-abstain instead of over-abstain?
- Why do next-speaker prediction baselines fail in group conversation settings?
- Why does RLHF training discourage the conversational repair work agents need?
- Are instruction-tuned models more or less sensitive to prompt semantics than others?
- Can AI systems recover from premature assumptions made early in multi-turn conversations?
- Why do traditional interfaces bypass the intention formation problem that language models expose?
- Can multi-turn reinforcement learning improve tool use in language models?
- What makes action-producing models fail in ways text models typically do not?
- Why do current language models fail at linguistic synchrony with clients?
- Why do LLMs fail to actively reject false presuppositions in conversation?
- What communicative optimization principles do language models fail to acquire?
- What happens to dialogue coherence when topic models use rigid stacks instead of flexible revisitation?
- Why do discourse failures cluster in attention and intentional layers rather than linguistics?
- Why do NLP benchmarks hide LLM failures in ambiguity handling?
- Do standard language benchmarks underestimate what LLMs can actually do?
- How does face-saving avoidance drive LLM grounding failures?
- What interaction design changes would help LLMs handle underspecified requests?
- Can auditing LLM performance on complex inputs improve NLP pipeline reliability?
- Can language models produce language more efficiently through interaction?
- Why do LLMs apply face-saving over accurately tracking resistance signals?
- Why do LLMs struggle to update beliefs across multiple conversation turns?
- What prompting strategies most effectively boost long-context LLM performance on retrieval?
- Why do weaker language models fail at multi-turn strategic questioning?
- How does RLHF helpfulness training drive premature assumptions in multi-turn dialogue?
- Why do LLMs systematically fail at information management in social interaction?
- Do LLM chatbots repeat this failure through comfort instead of clinical challenge?
- Why do benchmark tests fail to detect LLM comprehension gaps?
- Do LLM conversational agents currently detect and prevent derailment trajectories?
- How does credit assignment work across many sequential decision steps in language models?
- Can prompt position alone shift language model predictions by twenty percent?
- How does sequence organization differ between spoken conversation and text chat?
- Why do conversational systems benefit from post-thinking between user turns?
- Why do language models use twice as many words per conversation turn?
- How does the articulatory substrate explain direct speech-to-speech superiority over transcription pipelines?
- Can skipping transcription reduce speech dialogue latency below 300 milliseconds?
- How does preference optimization weaken conversational grounding in LLMs?
- Which conversation types most reliably cause models to drift from Assistant mode?
- What happens when we treat LLM outputs as sampled rather than stored?
- Why do benchmarks measuring string quality fail to capture communicative success?
- What makes a conversation real versus a sequence of generated strings?
- What training data barriers prevent LLMs from learning real Socratic dialogue?
- What prevents AI from recovering after conversations take a wrong turn?
- What would it mean for a language model to canvas counterpositions?
- Why does single-turn Q&A framing not match real user deployment patterns?
- How does local helpfulness per turn conflict with maintaining session-level conversational goals?
- What happens when students encounter errors they cannot resolve through prompting alone?
- Why do conversations with good openings but abrupt pivots fail most visibly?
- What latent mechanisms do LLMs use when they cannot execute iterative methods?
- How does this differ from using LLMs as the policy itself?
- How does RLHF training degrade LLM ability to model adversarial intent?
- Why do models struggle with asking questions in multi-turn conversational reasoning tasks?
- How do turn-level retrieval failures differ from dialogue-level accumulation failures?
- What update rules should govern dialogue-scoped versus turn-scoped memory?
- What causes silent document corruption in long LLM workflows?
- Why do LLMs choose incorrect edits despite understanding the task?
- Why do current large language models fail to entrain with users?
- At what complexity does LLM discourse failure become practically harmful?
- Why do LLM stories over-explain themes and favor single-track plots?
- Why do LLMs lack the communicative scaffold that humans learn?
- What distinguishes first-order from second-order agency in language models?
- Why do diffusion models fail at inherently sequential problems?
- Why do language models ignore condensed memory even when it is the only memory?
- Does prompting for accuracy actually reduce LLM hallucinations and errors?
- Can LLMs reliably audit other language models for errors?
- Why do LLMs degrade on long inputs before hitting context limits?
- Can instruction prompts reliably steer an LLM judge toward specific alignment targets?
- Why do standard next-token prediction models struggle with conversational initiative?
- Why do strong models struggle more with instruction following than mid-tier ones?
- Why does token ordering in LLMs create sequences rather than true temporal flow?
Related concepts in this collection 15
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why can't conversational AI agents take the initiative?
Explores whether current LLMs lack the structural ability to lead conversations, set goals, or anticipate user needs—and what architectural changes might enable proactive dialogue.
passivity prevents recovery; models can't redirect when lost
-
Why do language models respond passively instead of asking clarifying questions?
Explores whether the reward signals used to train language models might actively discourage them from seeking clarification or taking initiative in conversations, and what alternative training approaches might enable more collaborative dialogue.
next-turn rewards are the training cause of premature solution proposals
-
Can models learn to ask clarifying questions instead of guessing?
Exploring whether large language models can be trained to detect incomplete queries and actively request missing information rather than hallucinating answers or refusing to respond. This matters because conversational agents today remain passive, responding only when prompted.
proactive questioning is exactly the missing capability
-
Do models fail worse when their own errors fill the context?
As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
the lock-in mechanism: prior errors in context amplify future error rates
-
How do users actually form intent when prompting AI systems?
Users face a 'gulf of envisioning'—they must simultaneously imagine possibilities and express them to language models. This cognitive gap creates breakdowns not from AI incapability but from users struggling to articulate what they truly need.
underspecification reflects genuine inability to articulate intent, not user laziness
-
Why do users drift away from their original information need?
When users know their knowledge is incomplete but cannot articulate what's missing, do they unintentionally shift topics? And can real-time systems detect this drift?
ASK is the user-side cause of the underspecification that triggers premature assumptions: users in an anomalous knowledge state produce the vague queries that models cannot handle
-
Does preference optimization harm conversational understanding?
Exploring whether RLHF training that rewards confident, complete responses undermines the grounding acts—clarifications, checks, acknowledgments—that actually build shared understanding in dialogue.
RLHF incentivizes premature assumptions by rewarding confident answers over clarification; the training cause of the lost-in-conversation phenomenon
-
Why do language models sound fluent without grounding?
Explores whether LLM fluency masks the absence of communicative work—the clarifying questions, acknowledgments, and understanding checks that humans perform. Why does skipping these acts make models sound more confident?
the 77.5% grounding act reduction means models skip the communicative work that would prevent lock-in to incorrect assumptions
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
premature assumptions under underspecification are a specific form of presuming common ground that does not yet exist
-
Can language models track how minds change during persuasion?
Do LLMs understand evolving mental states in persuasive dialogue, or do they only capture fixed attitudes? This explores whether models can update their reasoning as a person's beliefs shift across conversation turns.
the static/dynamic ToM gap is a cognitive mechanism for getting lost: models can snapshot initial user state but cannot track how it evolves across turns, causing assumptions to diverge from the user's actual shifting needs
-
Can full episode rewards per step enable better credit assignment?
Can attributing cumulative episode reward to every step in a trajectory, rather than discounting by step distance, actually solve credit assignment in sequential LLM decision-making? This challenges intuitive RL assumptions about how credit should flow backward through time.
training-level fix: MS-GRPO's cumulative episode reward teaches models that early-turn decisions have downstream consequences, directly addressing the premature-commitment failure where models lock in to assumptions they cannot revise
-
Does including all conversation history actually help retrieval?
Conversational search systems typically use all previous context to understand current queries. But do topic switches in multi-turn conversations inject noise that degrades performance rather than helps it?
the retrieval-side fix for the lost-in-conversation problem: selective history prevents topic-switch contamination from making the current query context incoherent; the model gets lost partly because irrelevant prior turns warp the effective context
-
Can models identify what information they actually need?
When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
the diagnostic root: models that solve fully-specified problems at 40-50% on clarification tasks cannot identify what's missing when instructions arrive gradually; the information-gathering deficit precedes and causes the premature assumptions
-
Why do reasoning models overthink ill-posed questions?
Explores why models trained for extended reasoning produce drastically longer, less useful responses to unanswerable questions—and whether this represents a fixable training deficit or inherent limitation.
the behavioral mechanism: when underspecification creates ill-posed situations, reasoning models overthink rather than recognizing incompleteness — producing the verbose, non-recovering responses that characterize being "lost"
-
Why do AI agents miss most of what users actually want?
UserBench explores why current models align with user intent only 20% of the time, even when users reveal preferences across multiple turns. The question examines whether agents can learn to actively clarify ambiguous or evolving goals.
UserBench quantifies the downstream cost of premature assumptions: the 20% full-alignment rate reflects models that guess rather than elicit, and the <30% preference discovery rate confirms models cannot recover from initial misunderstandings
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LLMs Get Lost In Multi-Turn Conversation
- Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation
- Task-Oriented Dialogue with In-Context Learning
- How Many Instructions Can LLMs Follow at Once?
- Are LLMs All You Need for Task-Oriented Dialogue?
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
Original note title
llms get lost in multi-turn conversation because they make premature assumptions under underspecification and cannot recover