Why does NLI fine-tuning amplify frequency bias instead of teaching inference?
This explores why teaching a model the inference task (NLI = natural language inference: deciding whether one sentence entails another) ends up sharpening a counting shortcut — preferring whichever word is more common in the corpus — instead of installing genuine semantic reasoning.
This explores why fine-tuning on an inference task can deepen a frequency shortcut rather than teach inference, and the corpus has a surprisingly unified answer: fine-tuning doesn't write new reasoning into a model so much as it amplifies whatever the model already leaned on. The direct finding is that NLI fine-tuning makes models rely *more* on corpus-level frequency patterns — hypernyms ("animal") tend to appear more often than hyponyms ("dog"), and the model learns to ride that statistical gradient instead of checking actual entailment. The tell is adversarial cases: when frequency and the true label disagree, the fine-tuned model performs *worse*, which means the shortcut got reinforced, not corrected Does fine-tuning on NLI teach inference or amplify shortcuts?.
Why would gradient descent prefer the shortcut? Because the shortcut is already there before fine-tuning ever starts. A causal study varying random seeds and swapping tuning data found that models sharing a pretrained backbone keep the same bias fingerprint no matter what they're fine-tuned on — biases are *planted* in pretraining and only *nudged* afterward Where do cognitive biases in language models come from?. Fine-tuning operates on a model that has already decided frequency is a good predictor, so the cheapest way to lower training loss is to lean harder on that prior rather than build a new entailment-checking circuit.
This is part of a broader pattern where fine-tuning sharpens what exists instead of teaching procedures. RL-tuned models look like they reason but collapse on out-of-distribution variants of the same problem, revealing template-matching rather than an installed algorithm Do fine-tuned language models actually learn optimization procedures?. And RL post-training tends to converge on a single dominant *format* already present in pretraining, suppressing alternatives within the first epoch — again, amplification of a pre-existing distribution, not creation of new capability Does RL training collapse format diversity in pretrained models?. NLI frequency bias is the same story told with statistics instead of formatting.
There's a deeper reason the shortcut is so sticky: these models reason semantically, not symbolically. When you decouple semantic content from the logical task — give the correct rule but strip the familiar word associations — performance collapses, because the model is manipulating token associations rather than formal relations Do large language models reason symbolically or semantically?. Entailment *is* a formal relation, so a model built to follow associations will substitute the nearest associative proxy it has, and "which word is more common" is exactly such a proxy. The same dynamic shows up when strong training priors simply override contradicting in-context information — textual prompting can't dislodge the prior; you need causal intervention in the representations themselves Why do language models ignore information in their context?.
The quietly useful takeaway: if biases live in pretraining and fine-tuning only modulates them, then fixing a learned shortcut with more task-specific fine-tuning is pushing on the wrong layer. The corpus points elsewhere — toward methods that change the *signal* rather than the data, like using model confidence as an intrinsic reward to rank reasoning traces Can model confidence work as a reward signal for reasoning?, or supplying explicit negative examples that target the exact failure mode rather than hoping more positive examples crowd it out Can small models match large models on function calling?. Frequency bias survives fine-tuning because fine-tuning was never the place it was born.
Sources 8 notes
NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.