Why does aggregate accuracy fail as a metric for rare harmful cases?
This explores why a single headline number — overall accuracy — hides exactly the failures that matter most: the rare, high-stakes cases where a confident wrong answer causes real harm.
This explores why a single headline number — overall accuracy — hides exactly the failures that matter most. The corpus has a direct answer: harmful errors don't spread evenly across the test set, they concentrate. In medical triage, legal interpretation, and financial planning, models produce fluent, confident, wrong answers precisely in the rare cases where surface heuristics conflict with an unstated constraint — and because those cases are rare, strong overall accuracy can sit on top of a pile of dangerous misses without flinching Why do confident wrong answers hide in standard accuracy metrics?. Aggregate accuracy is an average, and averages are designed to drown out the tail. The tail is the whole point.
Why do the failures cluster there rather than scatter randomly? One note reframes the mechanism: these aren't 'distraction' errors where the model got confused by noise. They're composition failures — the model has to integrate conflicting signals, and it instead leans on a heuristic shortcut. Tellingly, removing the spurious cue makes things *worse*, the opposite of normal shortcut-learning behavior, because the real task was reconciling the conflict, not ignoring a distractor Why does removing spurious cues sometimes hurt model performance?. So the rare harmful case isn't an unlucky draw; it's a structurally distinct kind of problem that an accuracy score can't distinguish from an easy one it also got right.
The deeper trap is that confidence doesn't rescue you. A model can be perfectly consistent and still be reliably wrong — fixing the seed or zeroing the temperature just replays the same single draw from its distribution, so 'it says the same thing every time' tells you nothing about whether that thing is correct Does setting temperature to zero actually make LLM outputs reliable?. That's why approaches that catch rare harm look *past* both the aggregate score and the model's own confidence. One flags hallucination risk from pretraining-data statistics — entity combinations the model never saw — and fires even when the model is highly confident, because it targets the root cause rather than the symptom Can pretraining data statistics detect hallucinations better than model confidence?. Another swaps coarse global confidence for step-level confidence, catching a reasoning breakdown mid-trace that a trace-wide average would smooth over Does step-level confidence outperform global averaging for trace filtering?.
The unifying lesson — the thing you might not have known you wanted to know — is that *granularity is a safety property*. Every fix in this corpus replaces a single rolled-up number with a finer-grained signal: step-level instead of trace-level, data-side instead of confidence-side, and most strikingly, agentic evaluation that collects evidence per case and cuts 'judge shift' a hundredfold over a single LLM grader — though even that system cascaded errors through a memory module, a reminder that fine-grained evaluators need error isolation of their own Can agents evaluate AI outputs more reliably than language models?. Aggregate accuracy fails on rare harmful cases for the same reason a thermometer fails to find a tumor: it measures the wrong resolution. If harm lives in the tail, you have to evaluate at the resolution of the tail.
Sources 6 notes
Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.
Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.