INQUIRING LINE

Why do LLMs inherit causal biases from their training data?

This explores where LLM causal-reasoning biases actually come from — and the corpus points overwhelmingly at the training corpus itself, not the reasoning machinery on top of it.


This explores why LLMs end up with skewed causal reasoning, and the short version is that they're absorbing the statistical shape of their training text rather than failing at logic per se. When models are tested on classic causal puzzles — collider networks where you have to 'explain away' one cause given another — they make the *same* mistakes humans make, showing weak explaining-away and Markov violations item-for-item Do large language models make the same causal reasoning mistakes as humans?. That mirroring is the tell: if the errors matched human errors exactly, the source is likely the human-generated text both learned from, not some unique machine defect.

The most direct evidence on *where* the bias gets planted comes from a causal experiment that varied random seeds and cross-tuned models: biases are baked in during pretraining, and finetuning only nudges them around the edges Where do cognitive biases in language models come from?. So the instinct to fix bias with better instruction data is fighting the wrong layer. This shows up concretely in odd places — LLM recommenders inherit popularity bias straight from the pretraining corpus, with GPT-4 fixating on items frequent in its training text (The Shawshank Redemption everywhere) regardless of the target dataset's actual popularity Where does LLM recommendation bias actually come from?, Where do recommendation biases come from in language models?.

There's a deeper mechanism worth knowing: these models reason through *semantic association*, not symbolic logic. When you decouple meaning from the rules of a task, performance collapses even when the correct rule is sitting right there in context Do large language models reason symbolically or semantically?. Because reasoning rides on learned content associations, the model can't help importing the world's correlational structure — and content effects (believing a conclusion because it *sounds* true) reproduce human belief-bias signatures exactly Do language models show the same content effects humans do?. Causal reasoning is also unevenly strong: models handle it *better* than temporal reasoning precisely because causal connectives ('because', 'therefore') are explicit and frequent in text, while temporal order is usually left implicit Why do LLMs handle causal reasoning better than temporal reasoning?. The bias, in other words, is a fingerprint of what the corpus made easy to learn.

Here's the part you might not expect: a lot of what looks like 'reasoning bias' is really *distributional* bias. Framed as autoregressive probability machines, LLMs predictably fail on tasks with low-probability targets even when they're logically trivial — backwards alphabet, letter counting Can we predict where language models will fail?. The same gravity explains era-sensitivity in legal reasoning: models do worse on historical cases because recent cases dominate the corpus, giving older precedent shallower representations Why do language models struggle with historical legal cases?. And the bias isn't only over content — it's over *self-knowledge* too: when you ask a model why it did something, the answer mostly echoes training-data distributions rather than any real introspection Can language models actually introspect about their own states?.

The through-line: LLMs inherit causal biases because the corpus *is* the prior. They don't reason over a clean causal model of the world — they reproduce the correlational and frequency structure of the text they were trained on, and our biases are written all over that text. The practical upshot is that mitigation has to target pretraining and distribution, not the polish layer of finetuning or prompting.


Sources 10 notes

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Where does LLM recommendation bias actually come from?

GPT-4 concentrates recommendations on items popular in its pretraining corpus rather than in target datasets. The Shawshank Redemption dominates across different datasets even when they have different popularity distributions, revealing a domain-shift effect that standard debiasing methods cannot address.

Where do recommendation biases come from in language models?

Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Why do language models struggle with historical legal cases?

Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, probe whether the claim 'LLMs inherit causal biases because they absorb training-data distributions, not because of reasoning defects' still holds as a unified explanation, or whether recent work has fractured it into domain-specific or model-capability-dependent regimes.

What a curated library found — and when (findings span 2022–2025; these are dated claims, not current truth):
• LLMs exhibit human-like causal biases (weak explaining-away, Markov violations) mirroring human errors from training text (~2025).
• Causal biases are baked into pretraining; finetuning only nudges them at the margins (~2025).
• Models reason via semantic association, not symbolic logic; when semantics decouple from task rules, performance collapses (~2023).
• Causal reasoning outperforms temporal reasoning because causal connectives are explicit in text; temporal order is implicit (~2024).
• Era-sensitivity in legal reasoning: older precedents underrepresented in corpus, making models weaker on historical cases (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023): In-context semantic reasoners, not symbolic reasoners.
• arXiv:2502.10215 (2025): Do LLMs reason causally like us? Even better?
• arXiv:2507.07186 (2025): Planted in pretraining, swayed by finetuning — origins of cognitive bias.
• arXiv:2510.20941 (2025): Do LLMs truly understand when a precedent is overruled?

Your task:
(1) RE-TEST the distribution-as-prior thesis. Has scaled pretraining, synthetic data injection, constitutional AI, or mechanistic interpretability since shown that LLMs can *decouple* from corpus statistics during reasoning? Separate the durable claim (causal reasoning rides on learned associations) from the perishable one (biases are immutable without retraining). Where does the constraint still hold?
(2) Surface work from the last 6 months that *contradicts* the pretraining-is-primary finding — e.g., evidence that finetuning, in-context steering, or test-time intervention meaningfully *rewires* causal reasoning, not just nudges it.
(3) Propose two research questions that assume models *can* develop causal independence from corpus bias: (a) Under what training or architectural conditions does a model break the semantic-association ceiling? (b) Can post-hoc causal fine-tuning or mechanistic repair reduce era-sensitivity or popularity bias without full retraining?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines