SYNTHESIS NOTE
Conversational AI and Personalization Psychology, Society, and Alignment Reasoning, Retrieval, and Evaluation

Why do language models fail in gradually revealed conversations?

Explores why LLMs perform 39% worse when instructions arrive incrementally rather than upfront, and whether they can recover from early mistakes in multi-turn dialogue.

Synthesis note · 2026-02-22 · sourced from Conversation Topics Dialog
Why do AI conversations reliably break down after multiple turns? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Laban et al. (2025) conduct large-scale simulation experiments (200,000+ conversations) comparing LLM performance in single-turn fully-specified vs. multi-turn underspecified settings across six generation tasks. The finding is stark: all top open- and closed-weight LLMs exhibit significantly lower performance in multi-turn conversations, with an average drop of 39%.

The performance degradation decomposes into two components. The minor one is aptitude loss — models are slightly less capable when instructions arrive incrementally. The major one is unreliability increase — when models take a wrong turn, they get lost and do not recover. This is the "lost in conversation" phenomenon.

Four specific failure behaviors drive the degradation:

  1. Overly verbose responses — models generate too much too early
  2. Premature solution proposals — attempting final answers before sufficient information arrives
  3. Incorrect assumptions — filling in underspecified details with guesses
  4. Over-reliance on previous attempts — locking in to early (wrong) answers

The SHARDED simulation methodology is key: it transforms existing single-turn instructions into shards revealed one per turn, enforcing gradual disclosure. The CONCAT control confirms the effect is specifically about underspecification and multi-turn nature, not rephrasing. The drop appears even in two-turn conversations and across all LLMs from 8B to state-of-the-art.

Agent-like mitigations (RECAP: final-turn recapitulation; SNOWBALL: turn-level reminders) recover only 15-20% of the loss. The authors argue LLMs should natively support multi-turn interaction — relying on agent frameworks to preprocess is insufficient. Since Why can't conversational AI agents take the initiative?, this passivity compounds: models neither lead the conversation to gather missing information nor recover when their assumptions prove wrong.

The underspecification tested here is not adversarial — it reflects "the principle of least effort" (Zipf), a natural tendency in human conversation. Users routinely start vague and refine. The models' failure is thus a failure at normal conversation, not edge cases. Since Does preference optimization harm conversational understanding?, the premature assumptions are not random — they are incentivized by RLHF training that rewards confident single-turn answers over grounding acts like clarification. The alignment tax produces models that guess rather than ask, and the lost-in-conversation phenomenon is the multi-turn consequence. More specifically, since Why do language models sound fluent without grounding?, the 77.5% reduction in grounding acts means models skip the clarification and repair mechanisms that would prevent the lock-in to incorrect assumptions. And since Do language models actually build shared understanding in conversation?, the premature assumptions are a specific form of this: filling in underspecified details with guesses is precisely presuming common ground that does not yet exist.

The STORM framework reframes this from a model failure to a fundamental interaction design problem. Since How do users actually form intent when prompting AI systems?, underspecification is not laziness — it reflects that users genuinely cannot articulate their full intent upfront. The "gulf of envisioning" means users lack the vocabulary and conceptual framework to specify what they want, while the AI lacks the ability to help them develop it. This deepens the lost-in-conversation diagnosis: models don't just fail at underspecified inputs — they fail at the process through which intent matures from vague to specific.

MultiChallenge (2025) identifies four specific multi-turn challenge categories that all frontier models fail. Despite near-perfect scores on existing multi-turn benchmarks, all frontier models achieve less than 50% accuracy on MultiChallenge (Claude 3.5 Sonnet at 41.4%). The four categories: (1) instruction retention — following instructions from the first turn throughout the entire conversation; (2) inference memory of user information — recalling and connecting details scattered across previous turns; (3) reliable versioned editing — helping users revise materials through back-and-forth iterations; (4) self-coherence — maintaining consistency with model responses in conversation history and avoiding sycophancy. Each category requires simultaneous instruction-following, context allocation, and in-context reasoning, confirming that multi-turn failure is a compound capability gap, not a single missing skill. Source: Arxiv/Evaluations.

Inquiring lines that use this note as a source 108

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 15

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
26 direct connections · 216 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

llms get lost in multi-turn conversation because they make premature assumptions under underspecification and cannot recover