SYNTHESIS NOTE

Why do LLMs handle causal reasoning better than temporal reasoning?

Exploring whether language models perform asymmetrically on different discourse relations and what training data patterns might explain the gap between causal and temporal reasoning abilities.

Synthesis note · 2026-02-21 · sourced from Discourses

From the same discourse relations study: ChatGPT shows strong performance on causal relations — outperforming fine-tuned RoBERTa on two out of three benchmarks — while struggling with temporal order between events.

The most plausible explanation offered by the researchers: causal reasoning difficulty in temporal tasks "could be attributed to inadequate human feedback on this feature during the model's training process" — but more fundamentally, causal language is pervasive and explicitly marked in text. Explanations, arguments, news articles, scientific writing — all of these use causal connectives ("because," "therefore," "leads to," "causes") extensively and consistently.

Temporal order, by contrast, is often implicit. We say "she went to the store and bought milk" without specifying whether the events are sequential, simultaneous, or ordered in some other way. The ordering must be inferred from context, world knowledge, and linguistic cues that are less reliable than causal connectives.

The result is a capability asymmetry that tracks training data distribution: what's frequently and explicitly marked in text, LLMs learn to handle well. What's frequently implicit, they struggle with.

This is a generalizable prediction: wherever human language uses explicit, consistent surface markers, LLMs will perform better than where the same information is conveyed implicitly. Causal > temporal is one instance of this pattern. The same logic should apply to other discourse relations, pragmatic inferences, and any semantic content that is typically left implicit in language.

Shared biases, not just relative performance: The picture becomes more complex when comparing LLM causal reasoning not just against benchmarks but against human performance on the same tasks. "Do LLMs Reason Causally Like Us?" finds that on collider network reasoning (C1 → E ← C2), LLMs exhibit the same biases as humans: Markov violations (treating independent causes as positively correlated) and weak explaining away (the effect of observing one cause on reducing the probability of the other is weaker than normatively warranted). LLMs are not categorically worse at causal reasoning — they err in the same direction, likely because training data was produced by humans with these same biases. See Do large language models make the same causal reasoning mistakes as humans?.

Inquiring lines that read this note 55

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does conversational format create illusions of genuine AI communication?

Can AI arguments participate in discourse without temporal grounding?

How do formal dialogue structures reveal conversation coherence mechanisms?

How do language models establish social grounding in human dialogue?

Why do language models struggle with implicit discourse relations?

Do language models understand semantics or rely on pattern matching?

What is the difference between learning discourse patterns and learning abstract language?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

How do language models inherit human biases from training data?

What factors beyond surface content determine how readers extract meaning differently?

How does the location of causal passages differ between news and lectures?

How should dialogue recommender systems manage conversation history and state?

How should retrieval systems optimize for multi-step reasoning during inference?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How can emotions function as reliable information in reasoning and cognitive systems?

How do the four discourse relations differ in their connection to anxiety?

Do language models learn genuine linguistic structure or just surface patterns?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Do language models develop causal world models or rely on statistical patterns?

What structural factors drive popularity bias in recommendation systems?

Should time always be a first-class ranking signal in temporally-extended sources?

What capability tradeoffs emerge when scaling model reasoning abilities?

Why do foundation models develop task-specific heuristics instead of causal understanding?

How does reasoning graph topology affect breakthrough insights and generalization?

What makes a causal abstraction more transferable than a generic heuristic?

How do training data properties shape reasoning capability development?

What real-world forecasting domains benefit most from contextual reasoning integration?

Why does finetuning cause catastrophic forgetting of model capabilities?

Can time-awareness live in model parameters instead of retrieval?

How can identical external performance mask different internal representations?

What is the accuracy cost of enforcing temporal causality inside model parameters?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Can modular expert decomposition extend beyond time into other causal dimensions?

How do prompt structure and constraints affect model instruction reliability?

Why does token ordering in LLMs create sequences rather than true temporal flow?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 134 in 2-hop network ·medium cluster Open in graph ↗

Why do LLMs handle causal reasoning better than … Why does ChatGPT fail at implicit discourse relati… Can models pass tests while missing the actual gra…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does ChatGPT fail at implicit discourse relations? ChatGPT excels when discourse connectives are present but drops to 24% accuracy without them. What does this gap reveal about how LLMs actually process meaning and logical relationships?
the same training-data-surface-distribution pattern at the discourse relation level
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
structural parallel: surface regularity drives performance

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

causal reasoning is stronger than temporal reasoning in llms because causal patterns dominate training data

Why do LLMs handle causal reasoning better than temporal reasoning?

Inquiring lines that read this note 55

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4