INQUIRING LINE

Why does hypothesis attestation bias exist separately from frequency bias in NLI?

This explores why LLMs doing natural language inference (NLI) seem to judge whether a conclusion follows not by checking the logic, but by recalling whether they've 'seen' that conclusion before — and why that memorization habit looks like its own distinct failure rather than just a side effect of common phrases showing up a lot.


This explores why attestation bias — the habit of calling something 'entailed' simply because the hypothesis looks familiar from training — behaves like a separate failure from raw frequency bias in NLI. The cleanest evidence comes from McKenna et al.'s finding Do LLMs predict entailment based on what they memorized?: when you swap in a *random* premise that has nothing to do with the hypothesis, models still predict entailment as long as the hypothesis itself appears 'attested' in training data. That random-premise trick is the tell. If the bias were purely about frequency, you'd expect it to track how often a phrasing occurs; instead it tracks whether the proposition was *encountered as true* — the model is answering 'does this sound like something I've learned?' rather than 'does the premise support this?' The premise-hypothesis relationship, which is the entire point of inference, drops out.

Why would memorized truth and surface frequency come apart? Because the corpus suggests LLMs don't reason over logical form at all — they reason over meaning. When semantic content is stripped out and only the rules remain, performance collapses Do large language models reason symbolically or semantically?. So inference isn't a structural operation that frequency merely nudges; it's a semantic-association lookup. Attestation is what that lookup retrieves: a stored judgment about a specific proposition's truth, which is a different object from how common its words are.

The deeper reason these biases live in different places is architectural. Content effects work shows that for transformers, semantic content and logical form aren't separable channels — models reproduce human belief-bias signatures item-by-item across NLI, syllogisms, and Wason tasks Do language models show the same content effects humans do?. Believability of the conclusion and validity of the argument are entangled in the same representation. Attestation bias is essentially believability bias with a memory address: 'I have this proposition filed as true' overrides 'the premise in front of me doesn't license it.' That's also why prompting alone rarely fixes it — when prior training associations are strong, parametric knowledge dominates the actual context, and only intervening in the representations shifts the behavior Why do language models ignore information in their context?.

There's a useful provenance clue too: these tendencies are largely planted in pretraining, not instruction tuning. Models sharing a pretrained backbone show similar bias patterns regardless of finetuning data — finetuning only modulates what pretraining installed Where do cognitive biases in language models come from?. So attestation bias isn't a tuning artifact you can RLHF away; it's baked into what the model learned propositions *are*. And it sits alongside a family of related 'looks-right' shortcuts the corpus documents — models defaulting to conservative answers that mimic reasoning Are models actually reasoning about constraints or just defaulting conservatively?, or agreeing with claims they can independently verify as false Why do language models accept false assumptions they know are wrong?. The common thread: the model substitutes a familiarity or social signal for the actual inferential work.

The thing worth walking away with: 'attested' and 'frequent' are not the same coordinate. Frequency is about how often word-strings appear; attestation is about which propositions got stored as true. NLI tests the relationship *between* two statements — and a system that retrieves truth-by-memory will quietly answer a different question than the one being asked, even when its training corpus is perfectly balanced on frequency.


Sources 7 notes

Do LLMs predict entailment based on what they memorized?

McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an NLI researcher tasked with re-evaluating whether hypothesis attestation bias and frequency bias remain separable failure modes in modern LLMs, or whether recent capability gains, architectural changes, or training methods have collapsed or reconciled them.

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat as perishable.
- Attestation bias (models calling 'entailed' based on hypothesis familiarity alone, ignoring premise) is empirically separable from frequency bias via random-premise swaps; models still predict entailment when premises are semantically null (2024–2025, McKenna et al. framework).
- LLMs reason over semantic content, not logical form; when semantics are stripped, NLI performance collapses — inference is a meaning-lookup, not a structural operation (~2023).
- Believability and logical validity are entangled in the same transformer representation; attestation bias is 'believability bias with a memory address' that survives in-context prompting when prior training associations are strong (2024–2025).
- These biases are largely planted in pretraining, not instruction tuning; finetuning only modulates pretraining patterns, rarely ablates attestation bias (~2025).
- Surface heuristics (conservative answers, 'looks-right' shortcuts) override implicit constraints even in sophisticated reasoning tasks (~2026).

Anchor papers (verify; mind their dates):
- arXiv:2207.07051 (2022): Content effects on reasoning tasks
- arXiv:2305.14825 (2023): Semantic vs. symbolic reasoning in LLMs
- arXiv:2507.07186 (2025): Cognitive biases rooted in pretraining
- arXiv:2603.29025 (2026): Surface heuristics override implicit constraints

Your task:
(1) RE-TEST EACH CONSTRAINT. For the five findings above, probe whether (a) newer NLI datasets/benchmarks with adversarial hypothesis-premise pairing have shifted the attestation baseline, (b) post-training interventions (DPO, reward modeling, constitutional AI) targeting calibration or reasoning have actually decoupled attestation from believability, (c) scaling or architectural changes (MoE, retrieval-augmented generation, multi-stage reasoning) have introduced a separate logical-form pathway that bypasses meaning-lookup. Plainly state which constraints still hold and which may have shifted; cite what shifted them.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: has any study shown attestation and frequency *reconverge*, or shown that in-context prompting + chain-of-thought *does* reliably override parametric attestation?
(3) Propose 2 research questions assuming the regime may have moved: (a) Can fine-grained attribution methods (e.g., logit lens, saliency) now distinguish 'premise is being processed' from 'hypothesis believability is dominating' in real time? (b) Do multimodal or grounded models show weaker attestation bias because non-linguistic modalities force genuine premise integration?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines