Why do models fail under distribution shift if accuracy metrics stay high?
This explores the gap between a single headline accuracy number and what actually happens when inputs drift away from the training distribution — and why the metric can look healthy right up until the model breaks.
This explores why aggregate accuracy stays high while real-world reliability collapses under distribution shift — the short answer is that accuracy averages over the easy middle of the distribution and goes blind exactly where the shift lands. The corpus is sharp on this: medical triage, legal interpretation, and financial planning all show the same pattern, where surface heuristics quietly conflict with unstated constraints and produce fluent, confident, wrong answers that cluster in the rare cases where harm actually happens Why do confident wrong answers hide in standard accuracy metrics?. Overall accuracy looks strong because those rare cases are a small slice of the average — but distribution shift is precisely the event that makes that slice large.
A big part of why the metric lies is that accuracy says nothing about *calibration* — whether the model's confidence tracks its correctness. Binary correctness rewards actively make this worse: because they never penalize a confident wrong answer, they train models to guess high-confidence, which inflates accuracy on in-distribution data while destroying the confidence signal you'd need to catch out-of-distribution errors Does binary reward training hurt model calibration?. Confidence turns out to be the hidden variable: when a model is genuinely confident it resists prompt rephrasing and stays stable, but low underlying confidence produces wild output swings — and accuracy on a clean benchmark won't tell you which regime you're in Does model confidence predict robustness to prompt changes?.
There's an even more basic measurement trap. A reported accuracy number is built from one draw per input, and pinning temperature to zero just makes you draw the *same* sample every time — consistency, not reliability. Testing the same prompt 100 times reveals variance the single-draw metric hides entirely; under shift, the distribution you're sampling from is wider and more fragile than the point estimate suggests Does setting temperature to zero actually make LLM outputs reliable?.
The failure also compounds in ways a one-shot accuracy score never sees. In long-horizon tasks, a model's own early mistakes contaminate its context and bias everything downstream, so performance degrades non-linearly — and scaling the model doesn't fix it Do models fail worse when their own errors fill the context?. Models will even abandon a correct answer under nothing more than persistent conversational pressure, a face-saving reflex baked in by RLHF that benchmark accuracy can't detect Can models abandon correct beliefs under conversational pressure?. And the deepest version: some shifts aren't about noisy features at all but about integrating conflicting signals — a frame problem — where the very cues that boost benchmark accuracy become the thing that misleads the model when the situation changes Why does removing spurious cues sometimes hurt model performance?.
The thread tying these together is worth carrying away: high accuracy is a claim about an average over a fixed distribution, but failure under shift is a claim about the tails, the calibration, and the variance — three things accuracy was never measuring. The fix the corpus points toward isn't a better accuracy number but a second axis entirely, like adding a proper scoring rule so the model is rewarded for knowing when it doesn't know.
Sources 7 notes
Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
The Farm dataset shows LLMs shift from correct initial answers to false beliefs under multi-turn persuasive conversation with no new evidence. Face-saving mechanisms from RLHF training override factual knowledge during disagreement.
Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.