Can utility-weighted training loss actually harm model performance?
When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.
"Misaligned by Design" identifies a failure in what the authors call the Aligned Learning Premise (ALP): the intuition that using the human's utility function to train a model produces better performance in terms of that objective. In high-stakes settings where false positives and false negatives have asymmetric costs (e.g., medical diagnosis), engineers routinely bake these asymmetric weights into the training loss. This paper shows this can backfire.
The key insight: machine classifiers perform not one but two incentivized tasks. Choosing how to classify (given learned features, assign a label) — here asymmetric weighting works correctly. Learning how to classify (acquiring informative feature representations through gradient descent) — here asymmetric weighting can weaken the learning signal. Because the loss function shapes the gradient, it necessarily shapes incentives for learning. Making the loss asymmetric can reduce the payoff to "substantive learning" — the model learns less informative representations.
In both focal applications, training with a standard symmetric loss function then adjusting predictions ex-post according to the human's utility function outperforms training with the utility-weighted loss directly — even when evaluated by the utility-weighted objective itself. Trying to bake utility weights into training makes predictions worse.
This resonates with findings across the LLM training literature. Do reward models actually consider what the prompt asks? shows reward models that should evaluate answer quality actually ignore the question — an incentive misalignment between what the loss teaches and what the evaluation requires. Does supervised fine-tuning actually improve reasoning quality? shows SFT optimizing for accuracy inadvertently degrades reasoning quality — the loss correctly incentivizes choosing the right answer but weakens the incentive to learn informative reasoning paths.
The general principle: when a training objective conflates two functions (learning representations and making decisions), optimizing one can degrade the other. Separating them — learn first, then decide — may be superior even though it seems less elegant. The Does binary reward training hurt model calibration? finding is a direct instance: binary reward correctly incentivizes choosing (pick the right answer) but fails to incentivize learning calibrated confidence, and the Brier score fix explicitly separates these two objectives within the reward function.
Inquiring lines that use this note as a source 22
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do negative item weights matter more than model depth?
- What happens when a single loss function conflates representation learning with decision-making?
- Can log-likelihood loss combined with binary rewards achieve calibration?
- How do different training objectives shift whether models over-predict or under-predict?
- How does choosing fatigue affect which ranking positions matter most to users?
- Why do improvements in accuracy come at the cost of calibration?
- What makes top-N ranking loss difficult to optimize directly?
- How do loss functions simultaneously shape both learning and decision quality?
- What makes utility-weighted training backfire in machine learning systems?
- How do reward model biases cascade into downstream optimization failures?
- Can alignment methods model loss aversion without creating unintended sophistry?
- Can algorithm choice like PPO substitute for recipe-level design decisions?
- How does the Assistant Axis explain why warmth training degrades accuracy?
- Can model training address failures that really originate in harness gaps?
- How do difficulty metrics relate to the true value of training examples?
- What makes two timescales better than one for minimizing weight movement?
- Can dynamic variance weighting replace fixed objective combination weights?
- How does absolute-advantage weighting concentrate training on boundary cases?
- How should multi-objective post-training balance competing behavioral goals?
- What training regimes confound surface mechanisms with their actual causes?
- Do frontier models develop strategic misalignment from ordinary training pressure alone?
- What makes advantage shaping more stable than reward shaping for tool training?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reward models actually consider what the prompt asks?
Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
parallel: reward models conflate prompt-free and prompt-related evaluation, degrading learning
-
Does supervised fine-tuning actually improve reasoning quality?
While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
SFT accuracy objective weakens reasoning learning incentive
-
Does preference optimization damage conversational grounding in large language models?
Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
RLHF objective (helpfulness) weakens conversational grounding — same learning/choosing conflation
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning objective weakens abstention learning
-
Why do accurate predictions lead to poor decisions?
Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.
the formal framework for the gap: models optimized for prediction produce suboptimal decisions because the loss function conflates learning and choosing; the asymmetric loss finding provides the mechanism (loss shapes gradient for both learning and choosing simultaneously)
-
Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
specific instance of the learning/choosing conflation: binary reward correctly incentivizes choosing (pick the right answer) but fails to incentivize learning calibrated confidence; the Brier score fix separates these incentives, echoing the "train standard then adjust ex-post" prescription
-
Does learning to reward hack cause emergent misalignment in agents?
When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
the extreme downstream consequence: when the learning/choosing conflation is not detected, RL training can produce reward hacking that generalizes to emergent misalignment; "misaligned by design" (this note's framework) describes the structural vulnerability, while emergent misalignment demonstrates the catastrophic behavioral outcome when that vulnerability is exploited at scale
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Misaligned by Design: Incentive Failures in Machine Learning
- Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
- In-context learning agents are asymmetric belief updaters
- From Entropy to Epiplexity: Rethinking Information for Computationally Bounded Intelligence
- KTO: Model Alignment as Prospect Theoretic Optimization
- Provable Benefits of In-Tool Learning for Large Language Models
- Pre-Trained Policy Discriminators are General Reward Models
- DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning
Original note title
asymmetric loss functions can misalign machine learning because learning and choosing are distinct incentivized tasks — utility-weighted training can backfire