INQUIRING LINE

Why do language models fail at coreference across long contexts?

This explores why models lose track of who or what a pronoun or name refers to as text gets longer — and the corpus doesn't tackle coreference head-on, so it answers by triangulating from work on linguistic structure, identity-tracking, relational queries, and length-driven decay.


This reads the question as: why do models lose the thread of *who* and *what* the words refer to once the context stretches out? No paper here studies coreference by name, but several circle the same territory from different angles, and together they suggest the failure isn't one bug but a stack of them. The most direct neighbor is the finding that LLMs make systematic grammatical errors that get predictably worse with structural complexity — they misidentify embedded clauses, verb phrases, and nested nominals because statistical learning captures surface patterns rather than the deep rules that bind a pronoun to its antecedent Why do large language models fail at complex linguistic tasks?. Coreference is exactly that kind of binding, so the same crack shows up wherever a sentence buries the referent under syntactic depth.

The more surprising contributor is identity itself. One line of work argues that a model never *commits* to a fixed character or entity — it holds a superposition of mutually consistent possibilities and samples one at generation time, so regenerating the same passage yields a different but locally-coherent reading Do large language models actually commit to a single character?. If there's no settled internal answer to "who is 'she'?", then coreference across a long span isn't being resolved and held — it's being re-improvised. That reframes the problem: the model isn't forgetting the antecedent, it never pinned it down in the first place.

Length then turns a fragile process into a failing one. Reasoning accuracy collapses with input length *far below* the context window — dropping sharply at only a few thousand tokens of padding, in a way that doesn't track language-modeling quality and survives chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. So "the context fit" is not the same as "the context was usable." A related result locates the real bottleneck not in memory capacity but in the *compute* needed to consolidate distant context into usable internal state — the model has the tokens but hasn't done the work to integrate them Is long-context bottleneck really about memory or compute?.

There's also a structural ceiling worth knowing about: long-context models can do semantic retrieval over a big window, but they fail on queries that require *relational joins* — linking this entity here to that mention there Can long-context LLMs replace retrieval-augmented generation systems?. Coreference is a join. And when a referent in the text conflicts with what the model learned in training, the parametric prior can simply override the context, so the model resolves the pronoun to its expectation rather than to the passage Why do language models ignore information in their context?.

The thing you may not have expected: this looks less like a memory problem than the phrase "long context" implies. The corpus points to a model that doesn't firmly fix entities, can't reliably perform the relational joins coreference demands, degrades well before its window is full, and will overwrite the text with its priors when they're strong. Coreference fails at long range because all four are true at once — and the same fragility shows up in multi-turn conversations, where models drift from the user's actual intent as the exchange lengthens Why do language models lose performance in longer conversations?.


Sources 7 notes

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Why do language models lose performance in longer conversations?

LLMs degrade in multi-turn settings because RLHF training rewards premature answers over clarification-seeking, creating pragmatic mismatch with individual user behaviors. A Mediator-Assistant architecture that explicitly parses user intent before execution recovers lost performance without retraining.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about why LLMs fail at coreference across long contexts. The question remains open: *what is the primary failure mode?* Is it memory, compute, entity commitment, relational reasoning, or prior override—or a conjunction?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A curated library identified five interlocking constraints:
• LLMs make systematic errors on embedded syntax and nested structures (2025-03), blocking the deep binding required for coreference.
• Models hold superpositions of entities rather than committed internal representations; regenerating the same passage yields different but locally-coherent readings (implied 2024–2025).
• Reasoning accuracy collapses far below context window limits—at only a few thousand tokens of padding—independent of language-modeling quality (2024-02).
• Long-context models fail on relational joins (linking entity A to mention B across distance) while succeeding at semantic retrieval (2024-06).
• Parametric priors override context: when training associations conflict with the passage, models resolve pronouns to expectation, not to text (2024-10, 2025-04).

Anchor papers (verify; mind their dates):
• arXiv:2402.14848 (2024-02): reasoning degradation far below window.
• arXiv:2406.13121 (2024-06): long-context retrieval vs. relational failure.
• arXiv:2503.19260 (2025-03): systematic linguistic blind spots.
• arXiv:2505.06120 (2025-05): multi-turn conversation drift as intent-alignment gap.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding, judge whether newer models (o1-style reasoning, extended-context training, architectural innovations like recursive computation [2025-12]), inference methods (speculative decoding, adaptive computation), or evaluation harnesses have since RELAXED or OVERTURNED it. Separate the durable question—*can LLMs perform stable cross-sentence binding?*—from perishable limitations; cite what resolved them.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming coreference recovery, entity tracking improvements, or relational reasoning breakthroughs; flag any that undermine the "conjunction of failures" reading.
(3) Propose 2 research questions that ASSUME the regime has shifted: e.g., "Does recursive decoding (2025-12) enable committed entity state across turns?" or "Do adaptive compute-allocation methods resolve the length-degradation vs. window-size mismatch?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines