INQUIRING LINE

Does fine-tuning on NLI tasks amplify or reduce frequency bias in language models?

This explores a sharp result about what happens when you fine-tune a language model on natural language inference (NLI) tasks — and whether that training teaches real reasoning or just sharpens a statistical shortcut.


This explores whether fine-tuning on NLI actually teaches a model to reason about entailment, or whether it just deepens the model's reliance on how often words appear together. The corpus has a direct answer, and it's the uncomfortable one: NLI fine-tuning *amplifies* frequency bias rather than reducing it Does fine-tuning on NLI teach inference or amplify shortcuts?. Because hypernyms (general words like 'animal') show up more often in text than hyponyms (specific words like 'spaniel'), models learn to lean on that frequency gap as a proxy for entailment. The tell is adversarial cases: when frequency points one way and the actual entailment label points the other, fine-tuned models perform *worse* — meaning the training didn't correct the shortcut, it carved it deeper.

What makes this more than a one-off finding is that the same frequency-tracking habit shows up across completely different tasks. Models systematically prefer higher-frequency surface phrasings over rare-but-equivalent paraphrases — in math, translation, commonsense, and tool calling alike Do language models really understand meaning or just surface frequency?. So NLI fine-tuning isn't introducing a new flaw; it's pouring fuel on a mechanism that's already the model's default. The model is tracking statistical mass from pretraining and dressing it up as meaning-recognition.

That connects to a deeper question the corpus keeps circling: where do these biases actually live, and can fine-tuning move them? The evidence says biases are *planted in pretraining and only swayed — not removed — by fine-tuning* Where do cognitive biases in language models come from?. That reframes the NLI result entirely. Fine-tuning didn't fail to teach inference because the recipe was wrong; it failed because fine-tuning can't reach the layer where the frequency prior was formed. You're nudging a surface, not rewriting a foundation.

The pattern rhymes with other 'fine-tuning teaches the wrong thing' findings. RL fine-tuning, for instance, tends to sharpen memorized template-matching rather than installing a genuine reasoning procedure — out-of-distribution variants expose the gap Do fine-tuned language models actually learn optimization procedures?. And the broader linguistic picture is that statistical learning captures surface patterns but stumbles on deep grammatical structure as complexity rises Why do large language models fail at complex linguistic tasks?. NLI sits right at that fault line: entailment is a structural, semantic relationship, but the model keeps reaching for the surface frequency signal because that's what's cheapest.

The thing worth walking away with: 'fine-tuning on task X' doesn't reliably teach the *skill* behind task X — it often just amplifies whatever shortcut already correlates with the right answer in your training data. If you want to know whether a model learned inference or learned frequency, you have to build adversarial cases where the two disagree. Without that test, an amplified shortcut looks exactly like improved reasoning on the scoreboard.


Sources 5 notes

Does fine-tuning on NLI teach inference or amplify shortcuts?

NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher evaluating whether NLI fine-tuning truly teaches entailment reasoning or amplifies frequency bias—a question still open despite recent empirical work. The question remains live: does capability emerge from fine-tuning, or does it mask deepening shortcut reliance?

What a curated library found—and when (dated claims, not current truth):
Findings span 2024–2026 and converge on tension:
• NLI fine-tuning amplifies frequency bias rather than correcting it; models learn to treat hypernym prevalence as entailment proxy, failing adversarial cases where frequency and true labels diverge (~2025, arXiv:2505.21011).
• Frequency-preference generalizes across tasks (math, translation, commonsense, tool-calling); not NLI-specific (~2025, implied in path).
• Cognitive biases, including frequency tracking, are *planted in pretraining* and only surface-swayed by fine-tuning, not removed (~2025, arXiv:2507.07186).
• RL fine-tuning sharpens template-matching memorization rather than installing genuine reasoning; out-of-distribution tests expose the gap (~2025, arXiv:2504.07912).
• Linguistic blind spots on structural complexity worsen predictably as syntactic depth rises; fine-tuning does not repair underlying competence (~2025, arXiv:2503.19260).

Anchor papers (verify; mind their dates):
• arXiv:2505.21011 (2025): "LLMs are Frequency Pattern Learners in Natural Language Inference"
• arXiv:2507.07186 (2025): "Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Bias"
• arXiv:2504.07912 (2025): "Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining"
• arXiv:2503.19260 (2025): "Linguistic Blind Spots of Large Language Models"

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above—frequency amplification, pretraining primacy, RL deepening shortcuts, structural blindness—assess whether newer scaling (larger models, longer contexts, novel training procedures like Constitutional AI, chain-of-thought scaffolding, or layer-wise intervention) has relaxed or overturned it. Separate the durable question (what fine-tuning can and cannot do) from perishable limitations (e.g., can adversarial NLI datasets now be engineered to force genuine reasoning?). Cite what resolved each constraint; flag where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper argue fine-tuning *does* rewire core biases, or that frequency-learning and genuine reasoning co-emerge?
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., can multi-task fine-tuning on intentionally frequency-conflicting data reduce bias? Do newer architectures (SSMs, mixture-of-experts) distribute frequency shortcuts differently?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines