Why do improvements in accuracy come at the cost of calibration?
This explores why training that pushes a model to get more answers right often makes its confidence less trustworthy — the accuracy goes up while the model's sense of when it's actually right gets worse.
This explores why training that pushes a model to get more answers right often makes its confidence less trustworthy — and the corpus suggests the trade-off isn't accidental, it's baked into how we reward models. The clearest mechanism is the reward signal itself: when training only scores whether the final answer is correct, it never penalizes a confident wrong answer, so the model learns that high-confidence guessing is the optimal policy Does binary reward training hurt model calibration?. Accuracy and calibration come apart because the objective only ever measured one of them. Strikingly, the same note shows this is fixable — adding a proper scoring rule (the Brier score) as a second reward term makes the model optimize both at once, which means the trade-off is an artifact of an incomplete objective rather than a law of nature.
The deeper pattern is that "accuracy" can rise even as the underlying reasoning rots. Supervised fine-tuning lifts final-answer accuracy while cutting reasoning informativeness by nearly 39% — the model reaches right answers through pattern-matching shortcuts rather than genuine inference, so it becomes more correct and less auditable at the same time Does supervised fine-tuning actually improve reasoning quality?. A model that's right for shallow reasons has no good internal basis for knowing when it's wrong, which is exactly what miscalibration looks like.
What makes this dangerous is that aggregate accuracy actively hides the cost. In medical triage, legal interpretation, and financial planning, fluent confident errors concentrate in the rare, high-harm cases — and overall accuracy looks great precisely because those failures are statistically swamped by easy correct cases Why do confident wrong answers hide in standard accuracy metrics?. So optimizing for the headline number can quietly worsen the thing you'd most want calibrated: the model's hesitation on the cases where it shouldn't be sure.
The corpus also reframes calibration as a *directional* failure, not a single dial. Reasoning-trained models under-abstain and over-answer because abstention earns no reward, while safety-trained models over-abstain and refuse benign questions — same broken calibration, opposite tilt, each inherited from whichever objective dominated training Does training objective determine which direction models fail at abstention?. This is the lateral key to the whole question: calibration is a fingerprint of what you rewarded, so any accuracy-maximizing objective that ignores confidence will leave its own signature distortion.
There's a quieter cousin worth knowing about. Asymmetric, utility-weighted losses correctly sharpen *decisions* but weaken representation learning, so training to act well can degrade what the model actually learns to represent — and the fix is to learn with a symmetric loss, then adjust predictions afterward Can utility-weighted training loss actually harm model performance?. And once you're suspicious of confidence at all, note that even a model's apparent certainty is slippery: deterministic settings make outputs *consistent* without making them *reliable* — repeating the same answer 100 times doesn't mean it's well-calibrated Does setting temperature to zero actually make LLM outputs reliable?. Taken together, the corpus's answer is that calibration is collateral: it suffers whenever the training target rewards being right without also rewarding knowing how right you are.
Sources 6 notes
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Medical triage, legal interpretation, and financial planning show a consistent pattern: surface heuristics conflict with unstated constraints, producing fluent confident errors that concentrate in rare cases where harm occurs. Aggregate accuracy masks these failures because overall performance looks strong.
Reasoning-trained models under-abstain and overanswer because abstention is unrewarded, while safety-trained models over-abstain and refuse benign questions. This reveals calibration is not a single fixable axis but a characteristic failure signature that depends on which objective dominated training.
Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.