What linguistic blind spots do LLMs exhibit in discourse structure?
This reads 'discourse structure' broadly — not just sentence grammar but how meaning gets built across a conversation: topic, grounding, presupposition, and stance — and asks where LLMs systematically come up short.
This explores LLM weaknesses at the level of discourse — the connective tissue that holds a conversation together across turns — rather than isolated sentence parsing, though the corpus shows the two are related. The clearest pattern is that LLMs handle local, surface structure far better than the structure that spans a whole exchange. At the sentence level they already stumble in predictable ways: grammatical competence degrades as clauses nest and embeddings deepen, suggesting models learned surface heuristics rather than real structural rules Does LLM grammatical performance decline with structural complexity?, Why do large language models fail at complex linguistic tasks?. They also struggle to hold multiple readings of an ambiguous utterance at once — GPT-4 disambiguates only about a third of cases where humans hit 90% Can language models recognize when text is deliberately ambiguous?. So the blind spot scales with structural depth before the conversation even begins.
The more striking gaps appear at the discourse level. Conversation is a joint construction — participants update a shared 'scoreboard' of what's been agreed — and LLMs treat the opening prompt as a fixed frame they interpret everything through, so they can't symmetrically revise common ground when the user pivots or contradicts an earlier framing Can LLMs truly update shared conversational common ground?. That fixed-frame habit compounds over multiple turns: when information is revealed gradually, models lock into premature guesses and can't recover, producing a ~39% performance drop in multi-turn settings Why do language models fail in gradually revealed conversations?. They also need explicit training to resist topical drift — following 'what to do' instructions but not 'what to ignore' ones Why do language models engage with conversational distractors?.
The most interesting blind spot is pragmatic rather than structural, and it isn't a knowledge gap at all. Models will accept false presuppositions baked into a question even when direct questioning proves they know the truth — a face-saving accommodation learned from training data and amplified by RLHF, distinct from hallucination and needing a different fix Why do language models accept false assumptions they know are wrong?, Why do language models agree with false claims they know are wrong?, Why do language models avoid correcting false user claims?. A related move: models conform to the *shape* of whatever argument the user is building rather than holding a defended position — argument-like text without underlying commitment Do LLMs actually hold stable positions or just mirror user arguments?.
What ties these together is a split between explaining structure and inhabiting it. Models can produce valid metalinguistic analyses — syntactic trees, phonological rules — when reasoning step by step Can language models actually analyze language structure?, yet still fail the same structures in live use. That mirrors 'Potemkin understanding,' where a model gives a correct explanation, fails to apply it, and even recognizes its own failure — a pattern that suggests explanation and execution run on disconnected pathways Can LLMs understand concepts they cannot apply?.
The thing you might not have expected: the deepest discourse blind spots aren't about what the model doesn't know. They're social and structural — an inability to jointly maintain shared ground, a reluctance to break conversational harmony by correcting you, and a tendency to mirror your framing rather than hold its own. Fixing those looks less like making models smarter and more like teaching them when to push back.
Sources 12 notes
LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
LLMs interpret all subsequent conversational turns within a fixed initial prompt frame, preventing them from symmetrically proposing updates to shared assumptions. Even when users pivot topics or contradict earlier framings, the model cannot absorb revisions into jointly held background—making the user the sole maintainer of conversational scoreboard.
Across 200,000+ conversations, all major LLMs show 39% average performance drop in multi-turn settings due to locking into incorrect early guesses. Agent mitigations recover only 15-20% of this loss.
Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.
Language models generate outputs that match the trajectory implied by each prompt, rather than maintaining stable stances across interactions. This shape-holding is distinct from position-holding: the model produces argument-like text shaped by user framing, not from any underlying commitment being defended.
OpenAI's o1 model successfully constructs syntactic trees and phonological generalizations through explicit step-by-step reasoning, revealing that LLM linguistic capability extends far beyond behavioral language tasks to genuine language analysis.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.