Why do LLMs handle causal reasoning better than temporal reasoning?
Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.
From the same discourse relations study: ChatGPT shows strong performance on causal relations — outperforming fine-tuned RoBERTa on two out of three benchmarks — while struggling with temporal order between events.
The most plausible explanation offered by the researchers: causal reasoning difficulty in temporal tasks "could be attributed to inadequate human feedback on this feature during the model's training process" — but more fundamentally, causal language is pervasive and explicitly marked in text. Explanations, arguments, news articles, scientific writing — all of these use causal connectives ("because," "therefore," "leads to," "causes") extensively and consistently.
Temporal order, by contrast, is often implicit. We say "she went to the store and bought milk" without specifying whether the events are sequential, simultaneous, or ordered in some other way. The ordering must be inferred from context, world knowledge, and linguistic cues that are less reliable than causal connectives.
The result is a capability asymmetry that tracks training data distribution: what's frequently and explicitly marked in text, LLMs learn to handle well. What's frequently implicit, they struggle with.
This is a generalizable prediction: wherever human language uses explicit, consistent surface markers, LLMs will perform better than where the same information is conveyed implicitly. Causal > temporal is one instance of this pattern. The same logic should apply to other discourse relations, pragmatic inferences, and any semantic content that is typically left implicit in language.
Shared biases, not just relative performance: The picture becomes more complex when comparing LLM causal reasoning not just against benchmarks but against human performance on the same tasks. "Do LLMs Reason Causally Like Us?" finds that on collider network reasoning (C1 → E ← C2), LLMs exhibit the same biases as humans: Markov violations (treating independent causes as positively correlated) and weak explaining away (the effect of observing one cause on reducing the probability of the other is weaker than normatively warranted). LLMs are not categorically worse at causal reasoning — they err in the same direction, likely because training data was produced by humans with these same biases. See Do large language models make the same causal reasoning mistakes as humans?.
Inquiring lines that use this note as a source 54
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can AI arguments participate in discourse without temporal grounding?
- What makes human discourse fundamentally temporal in structure?
- Can LLMs infer situational context the way humans do pragmatically?
- Why do LLMs achieve only 24 percent accuracy on implicit discourse relations?
- What is the difference between learning discourse patterns and learning abstract language?
- What makes relational structure sufficient for generating contextually appropriate discourse?
- What architectural features enable counterfactual reasoning in world models?
- Why do causal graphs alone fail to capture human reasoning processes?
- Do language models exhibit the same causal biases that humans show?
- How does the location of causal passages differ between news and lectures?
- Can causal models be extended to include non-causal cognition?
- What role do time intervals play in shaping conversation responses?
- How should temporal metadata indexing differ from semantic indexing?
- Should LLM reasoning be studied as latent state trajectories rather than surface text?
- Can explicit connectives compensate for missing intentional tracking in LLMs?
- How do the four discourse relations differ in their connection to anxiety?
- Why do large language models fail at temporal reasoning in complex legal cases?
- Why do language models fail at implicit discourse relations while handling explicit connectives?
- Why do explicit discourse connectives help LLMs but implicit relations cause failures?
- Why do temporal reasoning patterns matter more than final answers?
- Can language models distinguish explicit from implicit discourse relations?
- Can language models develop world models that ground meaning in causal reality?
- How does context complexity affect LLM performance on temporal reasoning tasks?
- Why do LLMs inherit causal biases from their training data?
- Do LLMs rely on surface statistical patterns instead of causal structure?
- Why do LLMs perform better on explicit discourse connectives than implicit relations?
- Why does LLM compression eliminate causal grounding in conceptual representations?
- Why do explicit discourse connectives work when implicit relations fail?
- How does temporal event structure scaffold coherence in dialogue?
- Do language models consistently produce anachronistic output about historical periods?
- Should time always be a first-class ranking signal in temporally-extended sources?
- How do discourse relation types improve dialogue beyond sentence-level semantic matching?
- Can causal belief networks extracted from interviews predict how people respond to policy changes?
- Can functional semantic grounding substitute for true causal grounding?
- Why do causal reasoning directions succeed while temporal reasoning directions fail?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- Why do longer context windows alone fail to capture temporal dynamics in dialogue?
- Can LLMs reason through semantics without understanding causal mechanisms?
- How do causal belief networks extracted from interviews enable intervention reasoning?
- How does semantic association differ from mechanistic causal reasoning?
- What makes a causal abstraction more transferable than a generic heuristic?
- Can external actions provide causal necessity that language models lack?
- What real-world forecasting domains benefit most from contextual reasoning integration?
- Why do LLMs reason fluently about causality but lack causal rigor?
- How can extracted causal belief networks enable intervention simulation?
- Why do language models need external temporal signals at all?
- Can time-awareness live in model parameters instead of retrieval?
- How does temporal grounding in retrieval compare to architectural approaches?
- How does time-partitioned routing compare to retrieval-augmented temporal grounding?
- What is the accuracy cost of enforcing temporal causality inside model parameters?
- Can modular expert decomposition extend beyond time into other causal dimensions?
- Why does token ordering in LLMs create sequences rather than true temporal flow?
- What architectural changes would help LLMs distinguish causal relationships from temporal sequences?
- Do LLMs show stronger reasoning about causality than about temporal ordering?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why does ChatGPT fail at implicit discourse relations?
ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
the same training-data-surface-distribution pattern at the discourse relation level
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
structural parallel: surface regularity drives performance
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Exploring the Potential of ChatGPT on Sentence Level Relations: A Focus on Temporal, Causal, and Discourse Relations
- Do Large Language Models Reason Causally Like Us? Even Better?
- Causal Reflection with Language Models
- Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning
- Premise Order Matters in Reasoning with Large Language Models
- Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners
- Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- Mitigating Hallucinations in Large Language Models via Causal Reasoning
Original note title
causal reasoning is stronger than temporal reasoning in llms because causal patterns dominate training data