What trade-offs emerge between training objectives and model reliability?
This explores how the way you train a model — the objective you optimize for — quietly reshapes what it can be trusted to do, often degrading reliability in ways the training signal never penalized.
This explores how training objectives and model reliability trade off against each other — and the corpus is emphatic that they often do, in ways that don't show up until you look. The cleanest version of the trade-off is that *what you reward is what you get, including the parts you didn't mean to reward.* Binary correctness rewards are the classic case: because a confident wrong answer costs the same as a hesitant one, the model learns to guess boldly, and calibration quietly collapses Does binary reward training hurt model calibration?. The same logic explains why abstention fails asymmetrically — reasoning-trained models over-answer because abstaining is never rewarded, while safety-trained models over-refuse, so 'miscalibration' isn't one bug but a signature of whichever objective dominated Does training objective determine which direction models fail at abstention?.
The unsettling part is how *invisible* these costs are. Train a model to be warm and personable and it gets 10–30 points less reliable on medical reasoning, factual accuracy, and resisting disinformation — and standard safety benchmarks don't catch it at all Does warmth training make language models less reliable?. A separate thread shows that even determinism is a false comfort: pinning temperature to zero gives you the same answer every time, but it's still one draw from the distribution, so consistency masquerades as reliability Does setting temperature to zero actually make LLM outputs reliable?. The reliability you think you bought and the reliability you actually have keep diverging.
Reinforcement learning sharpens the trade-off into something almost mechanical. RL doesn't just teach skills — it collapses the model's diversity, converging on a single dominant pretraining format within the first epoch and suppressing the alternatives, where the 'winner' tracks model scale rather than quality Does RL training collapse format diversity in pretrained models?. Push too hard on impossible problems and the model learns degenerate shortcuts — answer repetition, computation-skipping — that then *contaminate* capabilities it already had Do overly hard RLVR samples actually harm model capabilities?. And drifting too far from the base model burns plasticity: staying close in KL terms preserves the ability to keep learning later, while parameter-heavy RL stalls when the domain shifts Does staying close to the base model preserve learning ability?. There's even a learning-vs-deciding split — utility-weighted loss makes a model decide better but learn worse, because the asymmetric signal starves feature acquisition Can utility-weighted training loss actually harm model performance?.
What's quietly hopeful is that several of these trade-offs turn out to be *design choices, not laws.* The calibration collapse from binary rewards vanishes if you add a Brier-score term — accuracy and calibration can be jointly optimized with no trade-off once the objective stops ignoring confidence Does binary reward training hurt model calibration?. The entropy damage from RL depends on *order*: train structured tasks before creative ones and you avoid entropy collapse wrecking open-ended ability, for a 6% gain over naive joint training Does training order reshape how models handle different task types?. Filtering matters more than scale — keep only clean positive trajectories but preserve diverse failures as negative signal, and a 14B model reaches frontier math performance Why do correct code trajectories teach models to tolerate errors?. Even messy data helps when matched to the learner: training on full exploration paths including failures builds more robust reasoning than clean shortcuts Can models learn better by training on messy exploration paths?, while teacher refinements beyond a student's frontier actively hurt unless the student filters for what it can absorb Does teacher-refined data always improve student model performance?.
The through-line the reader might not expect: reliability is rarely lost on the axis you were optimizing. It leaks out sideways — calibration, plasticity, format diversity, factual accuracy — into dimensions the objective never scored. The fix is almost never a bigger model; it's a richer objective, a smarter ordering, or better filtering that puts a price on the failure mode you forgot to penalize.
Sources 12 notes
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Reasoning-trained models under-abstain and overanswer because abstention is unrewarded, while safety-trained models over-abstain and refuse benign questions. This reveals calibration is not a single fixable axis but a characteristic failure signature that depends on which objective dominated training.
Five models trained for warmth showed 5–9pp error increases on medical reasoning, factual accuracy, and disinformation resistance. Emotional context amplified errors by 19.4%, and standard safety benchmarks failed to detect the degradation.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.
Research shows that training on messy trajectories—failed attempts, self-correction, and backtracking—teaches more robust reasoning than training only on shortcut solutions. This approach models o1-style deep reasoning as search internalization rather than solution memorization.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.