INQUIRING LINE

Why do large language models fail at temporal reasoning in complex legal cases?

This explores why LLMs stumble on time-ordering in hard legal cases — and the corpus suggests it's not really about law at all, but about how these models handle time, complexity, and unfamiliar material everywhere.


This explores why LLMs stumble on time-ordering in hard legal cases — and the interesting thing is that the corpus treats "legal" as almost incidental. The failure is really three separate weaknesses stacking on top of each other. The first is about time itself. LLMs are decent at causal reasoning ("X caused Y") because causal connectives are spelled out explicitly and often in training text, but temporal order is usually implicit and has to be inferred from context — so it's the weaker muscle from the start Why do LLMs handle causal reasoning better than temporal reasoning?. There's even an argument that the model has no real sense of time at all: token generation is sequential but atemporal, ordering tokens by probability without any duration or reflection between them Does AI text generation unfold through temporal reflection?.

The second weakness is complexity, and here the corpus is sharp: models keep basic temporal competence in short, structured prompts but start producing temporally impossible relationships in long, open-ended ones, falling back on frequency heuristics instead of structured reasoning as the input gets messier Why do language models fail at temporal reasoning in complex tasks?. And "long" arrives sooner than you'd think — reasoning accuracy can drop from 92% to 68% with just a few thousand tokens of padding, far below the context window limit, even with chain-of-thought Does reasoning ability actually degrade with longer inputs?. A complex legal case is exactly this: a long, tangled, multi-party timeline.

The third weakness is specific to law as a domain. Models perform measurably worse on historical legal cases than modern ones, because training corpora over-represent recent material and form shallower representations of older precedent Why do language models struggle with historical legal cases?. That matters for temporal reasoning in law because legal time-ordering often hinges on which precedent came first and whether a later case overruled an earlier one — exactly the older material the model knows least well.

Here's the part you might not expect: the deeper cause may not be "complexity" but unfamiliarity. One line of work argues reasoning models don't break at some complexity threshold — they break at instance-novelty boundaries, fitting patterns from similar examples they've seen rather than running a general algorithm Do language models fail at reasoning due to complexity or novelty?. A novel legal fact pattern with an unusual timeline is novel twice over. Related work reframes some "reasoning" collapses as execution failures — the model knows the procedure but can't carry out many steps in pure text, and does better when given tools Are reasoning model collapses really failures of reasoning?. And the unsettling "potemkin" pattern shows models can correctly explain a concept, fail to apply it, and recognize the failure — explanation and execution running on disconnected tracks Can LLMs understand concepts they cannot apply?. So a model can recite the rule for ordering precedents and still get the order wrong.

The through-line, drawn across all these: this isn't a legal-knowledge gap you fix with more case law. It's the predictable behavior of a probability machine that handles implicit relations worse than explicit ones, degrades with length, and leans on familiarity over algorithm — a pattern general enough that researchers can forecast where it'll appear from the model's autoregressive nature alone Can we predict where language models will fail?.


Sources 9 notes

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Does AI text generation unfold through temporal reflection?

Token ordering in LLMs follows probabilistic selection without intervening reflection or revision. Human discourse gains meaning from temporal structure—time spent thinking changes what comes next—but AI text production lacks this duration-in-reflection despite appearing sequentially composed.

Why do language models fail at temporal reasoning in complex tasks?

LLMs maintain basic temporal competence in simple structured formats but generate temporally impossible relationships in long, open-ended contexts. This degradation tracks training data distribution and emerges as models rely on frequency heuristics rather than structured reasoning under complexity.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst probing whether recent LLM advances have relaxed or overturned constraints on temporal reasoning in legal contexts. The question remains: why do LLMs fail at time-ordering in complex legal cases, and has that failure mode shifted?

What a curated library found — and when (dated claims, not current truth):

Findings span 2023–2026; treat them as perishable:

• Causal reasoning outperforms temporal reasoning in LLMs because causal connectives are explicit in text; temporal order must be inferred, forming a weaker capability (~2025).
• Reasoning accuracy drops from 92% to 68% with just a few thousand tokens of input padding, far below context-window limits, even with chain-of-thought (~2024).
• Models perform measurably worse on historical legal cases than modern ones due to training-corpus recency bias and shallower precedent representations (~2024–2025).
• Reasoning breakdown correlates with instance-level unfamiliarity (novel fact patterns), not abstract task complexity; execution failures (inability to chain steps in text) often masquerade as reasoning failures (~2025).
• Models exhibit "Potemkin understanding": correct explanation + failed application + recognized failure, suggesting disconnected execution tracks (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2402.14848 (2024-02): Input length degrades reasoning.
• arXiv:2502.10215 (2025-02): Causal vs. temporal reasoning contrast.
• arXiv:2510.20941 (2026-02): Precedent overruling competence in LLMs.
• arXiv:2507.10624 (2025-07): Architectural limits in symbolic tasks.

Your task:

(1) RE-TEST EACH CONSTRAINT. For each finding above, assess whether newer model architectures (reasoning-native, verifier-augmented, or hybrid symbolic–neural), retrieval-augmented generation (RAG) over legal databases, multi-step tool use (theorem provers, temporal databases), or post-training refinement have since relaxed or dissolved it. Distinguish the durable problem (temporal reasoning *per se*) from the perishable limitation (e.g., context degradation). Cite what resolved it; state plainly where the constraint still binds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look especially for: (a) legal LLM systems that succeed at precedent time-ordering; (b) evidence that instance-novelty is not the true blocker; (c) claims that execution failures have been solved by tooling or architecture; (d) reframes of "Potemkin understanding" as solvable via training or prompting.

(3) Propose 2 research questions that ASSUME the regime may have moved: one that assumes temporal reasoning in law is now a *tool-access* or *architecture* problem (not a fundamental capability gap), and one that asks whether time-order failures in law are actually *retrieval* failures (models don't find the right precedent) masquerading as reasoning failures.

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines