INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›How faithfully do LLMs reflect the…›this inquiring line

AI models fail on old legal cases and long documents for the same reason: both trace back to thin training data.

How does era sensitivity in legal cases compound with context length failures?

This explores whether two separate weaknesses — LLMs doing worse on older legal precedent, and reasoning breaking down as inputs get longer — share a common root and stack on top of each other when you feed a model a long brief about a historical case.

This explores whether two separate weaknesses — LLMs doing worse on older legal precedent, and reasoning breaking down as inputs get longer — share a common root and stack on top of each other. The corpus suggests they aren't independent failures at all: both trace back to the same thing, which is how unevenly the training data is distributed.

Start with era sensitivity. On a Supreme Court overruling benchmark, models reliably do worse on historical cases than modern ones, and the diagnosed cause isn't the law being harder — it's that recent cases flood the training corpus while older precedent is thinly represented, leaving the model with shallower internal representations of it Why do language models struggle with historical legal cases?. That's not a quirk of legal text. The same fingerprint shows up in temporal reasoning generally: models stay competent on short, structured time questions but start generating impossible timelines in long open-ended contexts, and that breakdown 'tracks training data distribution' as the model falls back on frequency heuristics instead of actually reasoning Why do language models fail at temporal reasoning in complex tasks?. Both failures are the model leaning on what was common in training when the going gets hard.

Now the context-length half — and here's the part most people underestimate. Reasoning accuracy doesn't just degrade near the context window limit; it drops from 92% to 68% with only 3,000 tokens of padding, far below capacity, and chain-of-thought prompting doesn't rescue it Does reasoning ability actually degrade with longer inputs?. A complementary view reframes the bottleneck as not memory but the compute needed to consolidate everything in the window into usable internal state Is long-context bottleneck really about memory or compute?. So a long historical case file is a double tax: the model already has a thin grip on the era, and the length itself is eroding whatever reasoning it could muster.

The compounding mechanism becomes clearer through a third lens — failures are driven by instance-level unfamiliarity, not raw complexity. Models succeed on any reasoning chain when they've seen similar instances and fail at novelty boundaries, because they fit instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. A historical case is precisely a low-familiarity instance, and a long document pushes more of that unfamiliar material through exactly the conditions where reasoning is most fragile. The two weaknesses don't add — the era gap makes the content novel, and length is the multiplier on novel content.

What you didn't know you wanted to know: the corpus points to a defense that sidesteps both. A multilingual RAG system built for noisy, drifting historical newspapers wins not by reasoning harder but by aggressively expanding retrieval while forcing the model to refuse any answer it can't ground in evidence — trading coverage for integrity exactly where source quality is degraded Can RAG systems refuse to answer without reliable evidence?. The catch is that long context alone won't substitute for this: long-context models match retrieval on semantic tasks but fail on structured, relational queries, so stuffing the whole case file into the window is the worst of both worlds Can long-context LLMs replace retrieval-augmented generation systems?. The escape hatch isn't a bigger window — it's grounding plus the discipline to abstain when the era is thin and the document is long.

Sources 7 notes

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Why do language models fail at temporal reasoning in complex tasks?

LLMs maintain basic temporal competence in simple structured formats but generate temporally impossible relationships in long, open-ended contexts. This degradation tracks training data distribution and emerges as models rely on frequency heuristics rather than structured reasoning under complexity.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Show all 7 sources

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey3.41 match · arxiv ↗
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models2.61 match · arxiv ↗
Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning1.75 match · arxiv ↗
Large Language Model Reasoning Failures1.71 match · arxiv ↗
Self-Guided Test-Time Training for Long-Context LLMs1.71 match · arxiv ↗
CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning1.70 match · arxiv ↗
Do LLMs Truly Understand When a Precedent Is Overruled?1.70 match · arxiv ↗
Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs1.67 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a legal AI researcher stress-testing claims about LLM performance on historical cases and long documents. The question: do era sensitivity (worse performance on older precedent) and context-length reasoning degradation share a root cause, and if so, does that reshape how we should architect legal AI systems?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat all as perishable:
• Models fail on historical Supreme Court cases not because old law is harder, but because training data skews recent; thin historical representation = shallow internal models (2025-10, arXiv:2510.20941).
• Reasoning accuracy drops from 92% to 68% with only 3,000 tokens of padding—far below capacity limits—and chain-of-thought does not rescue it (2024-02, arXiv:2402.14848).
• Both era sensitivity and length degradation trace to instance-level unfamiliarity: models fit patterns seen in training and fail at novelty boundaries, making historical + long documents a compounded low-familiarity case (2025-10, arXiv:2510.18176).
• Long-context models match retrieval on semantic tasks but fail on structured relational queries; stuffing a whole case file into context is worse than hybrid retrieval (2024-06, arXiv:2406.13121).
• Grounded generation with explicit abstention (refusing answers unsupported by evidence) tolerates noisy, drifted historical sources better than raw reasoning (2025-06, arXiv:2506.09038).

Anchor papers (verify; mind their dates):
• arXiv:2510.20941 (2025-10): Do LLMs Truly Understand When a Precedent Is Overruled?
• arXiv:2402.14848 (2024-02): Same Task, More Tokens — Impact of Input Length on Reasoning
• arXiv:2406.13121 (2024-06): Can Long-Context Language Models Subsume Retrieval, RAG, SQL?
• arXiv:2506.09038 (2025-06): AbstentionBench — Reasoning LLMs Fail on Unanswerable Questions

Your task:
(1) RE-TEST THE STACKING CLAIM. For each finding above, probe whether newer models (post-2026 frontier), improved training procedures (continued pretraining on historical corpora, synthetic era-diverse data), or hybrid orchestration (dense retrieval + adaptive window sizing + confidence-gated abstention) have since decoupled or re-coupled era sensitivity and length degradation. Is the root cause (training data distribution + instance-level unfamiliarity) still holding, or has domain-specific pretraining on legal archives dissolved it? Separate the durable question (why do models fail on low-familiarity long inputs?) from the perishable constraint (thin historical representation as currently deployed).
(2) Surface the strongest CONTRADICTING work from the last ~6 months: any paper showing that long-context alone (or recent architecture changes) successfully handles historical legal reasoning, or that era sensitivity and length degradation are orthogonal, not compounding.
(3) Propose two research questions assuming the regime has moved: (a) If synthetic era-diverse pretraining or retrieval-augmented fine-tuning neutralizes the interaction, what new failure mode emerges (spurious temporal analogies? hallucinated precedent chains?)? (b) Does grounding + abstention remain the bottleneck, or is the real constraint now the cost of building drifted-source-aware evaluation benches for historical law?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI models fail on old legal cases and long documents for the same reason: both trace back to thin training data.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8