INQUIRING LINE

Does fine-tuning on NLI tasks reduce or amplify frequency bias?

This explores whether training a model specifically on natural language inference (NLI) — the task of judging whether one sentence implies another — actually teaches reasoning, or whether it just sharpens a shortcut the model already had.


This explores whether fine-tuning on NLI teaches genuine inference or just deepens a statistical shortcut — and the corpus answers clearly: it amplifies frequency bias rather than reducing it. The core finding is that NLI fine-tuning makes a model lean harder on which words appear more often in its training corpus (hypernyms like 'animal' are more common than hyponyms like 'beagle'), and it uses that frequency signal as a proxy for entailment. The tell is adversarial cases: when the frequency pattern points one way but the actual entailment label points the other, fine-tuned models do worse than before. The shortcut didn't get corrected by training — it got entrenched Does fine-tuning on NLI teach inference or amplify shortcuts?.

Why would more training make a bias stronger instead of weaker? A complementary line in the corpus suggests the bias isn't really created by fine-tuning at all — it's planted during pretraining and only nudged afterward. A causal study using random seeds and cross-tuning found that models sharing a pretrained backbone show the same bias patterns no matter what data you fine-tune them on; fine-tuning modulates, it doesn't author Where do cognitive biases in language models come from?. Read together, the two notes tell one story: the frequency prior lives deep in the weights, and a task like NLI — where frequency happens to correlate with the right answer most of the time — gives the model a reason to rely on it even more.

This is an instance of a broader pattern worth seeing: strong training-time priors override what's actually in front of the model. Language models routinely ignore information in their context when parametric knowledge from training is confident, and plain prompting can't talk them out of it — you need to intervene in the representations themselves Why do language models ignore information in their context?. NLI frequency bias is the same dynamic at the level of a single inference: the corpus-level prior outshouts the semantic relationship the task is supposed to test.

The amplification effect also isn't unique to NLI or to supervised fine-tuning. Reinforcement learning shows a structurally similar move — RL post-training latches onto one dominant format already present in the pretraining distribution and suppresses the alternatives, often within a single epoch, picking the winner by prevalence rather than by performance Does RL training collapse format diversity in pretrained models?. Different objective, same gravitational pull toward whatever the base model already does most. And the reason these shortcuts survive is that surface statistics genuinely capture a lot: models that ace easy NLI nonetheless fail systematically on deeper structure, misreading embedded clauses and complex grammar as syntactic depth increases — evidence that statistical pattern-matching and real grammatical understanding are different things wearing the same score Why do large language models fail at complex linguistic tasks?.

The thing you didn't know you wanted to know: a benchmark improvement after fine-tuning can mean the model learned the *task* or learned a *correlate* of the task, and the two look identical until you build adversarial cases that pry them apart. NLI is a clean place to catch the difference — but the lesson generalizes to almost any fine-tuning result you're tempted to trust.


Sources 5 notes

Does fine-tuning on NLI teach inference or amplify shortcuts?

NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about fine-tuning, frequency bias, and NLI in current large language models. The question remains: Does fine-tuning on NLI tasks reduce or amplify frequency bias?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and include:
• Fine-tuning on NLI amplifies frequency bias rather than correcting it; models learn to rely harder on word prevalence (hypernyms over hyponyms) as an entailment proxy (~2025, arXiv:2505.21011).
• Frequency priors are planted during pretraining, not created by fine-tuning; fine-tuning modulates but does not author the bias (~2025, arXiv:2507.07186).
• LLMs systematically fail on embedded clauses and complex grammar, worsening predictably with syntactic depth — evidence that statistical pattern-matching differs from genuine grammatical understanding (~2025, arXiv:2503.19260).
• RL post-training converges on a single dominant pretraining distribution format within one epoch, suppressing alternatives by prevalence rather than performance (~2025, arXiv:2504.07912).
• Consistency training and representation-level interventions can suppress shortcuts like sycophancy; adversarial cases still expose frequency-driven failures (~2025, arXiv:2510.27062).

Anchor papers (verify; mind their dates):
• arXiv:2505.21011 (2025-05): LLMs are Frequency Pattern Learners in Natural Language Inference
• arXiv:2507.07186 (2025-07): Planted in Pretraining, Swayed by Finetuning
• arXiv:2504.07912 (2025-04): Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
• arXiv:2503.19260 (2025-03): Linguistic Blind Spots of Large Language Models

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, assess whether newer models (GPT-4o, o1, Claude 3.5+), intervention methods (consistency training, representation surgery, multi-step prompting), or evaluation protocols have since relaxed or overturned the frequency-bias amplification effect. Separate the durable question (does fine-tuning on NLI teach reasoning or exploit correlation?) from the perishable limitation (are current models still trapped by pretraining priors?). Cite what architecture, training, or evaluation change resolved it; state plainly where the constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months showing fine-tuning *does* reduce bias, or that newer methods (e.g., constitutional AI, curriculum-aware fine-tuning, mechanistic interventions) genuinely decouple inference from frequency.
(3) Propose 2 research questions that assume the regime may have moved: (a) Do post-training methods that explicitly penalize surface-statistic shortcuts (e.g., adversarial fine-tuning) now measurably reduce frequency bias on out-of-distribution NLI? (b) Can probing or intervention at the representation level during fine-tuning decouple frequency priors from task performance without sacrificing benchmark gains?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines