Can log-likelihood loss combined with binary rewards achieve calibration?
This explores whether the standard training recipe — log-likelihood loss plus simple right/wrong (binary) rewards — can produce a model that *knows what it knows*, i.e. whose stated confidence matches its actual accuracy.
This explores whether the usual training combo of log-likelihood loss and binary correctness rewards can give you a well-calibrated model — and the corpus's sharpest answer is no, not on its own. Binary rewards have a built-in flaw: they only ask 'was the final answer right?', so they never punish a model for being confidently wrong. The result is that training actively pushes models toward high-confidence guessing, because a lucky confident guess pays the same as a careful one. Does binary reward training hurt model calibration? shows this degradation is provable, not incidental — and that the fix is to add a *proper scoring rule* (the Brier score) as a second reward term, which mathematically lets accuracy and calibration improve together rather than trading off.
The deeper point is that the reward signal, not the likelihood loss, is doing the damage. A loss function inherits whatever objective you point it at. Can utility-weighted training loss actually harm model performance? makes this vivid from the opposite direction: when you bend the loss toward a decision-oriented (utility-weighted) objective, you sharpen *choosing* but you weaken the model's underlying representation, because you've starved the gradients that teach it real structure. The lesson cuts both ways — coarse, outcome-only signals (a binary reward, an asymmetric loss) optimize the thing they measure and quietly corrupt the things they don't, calibration being the usual casualty.
There's a more elegant route than bolting on a penalty: make the model's own confidence part of the reward. Can model confidence work as a reward signal for reasoning? (RLSF) ranks reasoning traces by the model's answer-span confidence, building synthetic preferences that both strengthen step-by-step reasoning *and* reverse the calibration damage that ordinary RLHF inflicts — without human labels or external verifiers. So calibration isn't a tax you pay against capability; with the right signal the two move in the same direction.
Worth knowing as you go deeper: the *shape* of the reward matters as much as its content. Does negative reinforcement alone outperform full reinforcement learning? finds that training only on wrong answers (suppressing them) preserves output diversity, whereas positive-only reinforcement concentrates probability mass and collapses the spread across higher-k sampling — and that spread is exactly what honest uncertainty looks like. Meanwhile a whole line of work argues that richer reward signals beat thin binary ones in general: Can judges that reason about reasoning outperform classifier rewards? and Can reward models benefit from reasoning before scoring? show that rewards which *reason* about the work, rather than just stamp it correct/incorrect, raise the ceiling on what the model can learn.
One caution before you trust any calibration number: Does setting temperature to zero actually make LLM outputs reliable? reminds us that a confident, repeatable output is still just one draw from a distribution — consistency is not the same as being right, and a model can be reliably miscalibrated. So the honest answer to the question is: log-likelihood plus binary rewards *can't* get you calibration by themselves, but log-likelihood plus a properly-scored or confidence-aware reward can.
Sources 7 notes
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.