SYNTHESIS NOTE

Can utility-weighted training loss actually harm model performance?

When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

"Misaligned by Design" identifies a failure in what the authors call the Aligned Learning Premise (ALP): the intuition that using the human's utility function to train a model produces better performance in terms of that objective. In high-stakes settings where false positives and false negatives have asymmetric costs (e.g., medical diagnosis), engineers routinely bake these asymmetric weights into the training loss. This paper shows this can backfire.

The key insight: machine classifiers perform not one but two incentivized tasks. Choosing how to classify (given learned features, assign a label) — here asymmetric weighting works correctly. Learning how to classify (acquiring informative feature representations through gradient descent) — here asymmetric weighting can weaken the learning signal. Because the loss function shapes the gradient, it necessarily shapes incentives for learning. Making the loss asymmetric can reduce the payoff to "substantive learning" — the model learns less informative representations.

In both focal applications, training with a standard symmetric loss function then adjusting predictions ex-post according to the human's utility function outperforms training with the utility-weighted loss directly — even when evaluated by the utility-weighted objective itself. Trying to bake utility weights into training makes predictions worse.

This resonates with findings across the LLM training literature. Do reward models actually consider what the prompt asks? shows reward models that should evaluate answer quality actually ignore the question — an incentive misalignment between what the loss teaches and what the evaluation requires. Does supervised fine-tuning actually improve reasoning quality? shows SFT optimizing for accuracy inadvertently degrades reasoning quality — the loss correctly incentivizes choosing the right answer but weakens the incentive to learn informative reasoning paths.

The general principle: when a training objective conflates two functions (learning representations and making decisions), optimizing one can degrade the other. Separating them — learn first, then decide — may be superior even though it seems less elegant. The Does binary reward training hurt model calibration? finding is a direct instance: binary reward correctly incentivizes choosing (pick the right answer) but fails to incentivize learning calibrated confidence, and the Brier score fix explicitly separates these two objectives within the reward function.

Inquiring lines that read this note 24

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What structural factors drive popularity bias in recommendation systems?

What determines success in training models on multiple tasks?

Can model confidence signals reliably improve reasoning quality and calibration?

How do self-generated feedback mechanisms enable effective model learning?

How do different training objectives shift whether models over-predict or under-predict?

How can identical external performance mask different internal representations?

What makes top-N ranking loss difficult to optimize directly?

Can alternative training methods improve on supervised fine-tuning for language models?

How do loss functions simultaneously shape both learning and decision quality?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Can language model RL training avoid reward hacking and misalignment?

How can AI alignment serve diverse human preferences at scale?

Can alignment methods model loss aversion without creating unintended sophistry?

How do policy learning algorithm choices affect multi-objective optimization stability?

Can algorithm choice like PPO substitute for recipe-level design decisions?

Can AI systems balance emotional competence with factual reliability?

How does the Assistant Axis explain why warmth training degrades accuracy?

How can AI systems learn from failures without cascading errors?

Can model training address failures that really originate in harness gaps?

How does example difficulty affect learning efficiency in language models?

How do difficulty metrics relate to the true value of training examples?

Why does finetuning cause catastrophic forgetting of model capabilities?

Can dynamic variance weighting replace fixed objective combination weights?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does absolute-advantage weighting concentrate training on boundary cases?

Why do benchmark improvements fail to reflect actual reasoning quality?

What training regimes confound surface mechanisms with their actual causes?

Can next-token prediction alone produce genuine language understanding?

Why does token-level gradient targeting matter more than aggregate loss?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 175 in 2-hop network ·dense cluster Open in graph ↗

Can utility-weighted training loss actually harm… Do reward models actually consider what the prompt… Does supervised fine-tuning actually improve reaso… Does preference optimization damage conversational… Does reasoning fine-tuning make models worse at de… Why do accurate predictions lead to poor decisions… Does binary reward training hurt model calibration… Does learning to reward hack cause emergent misali…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do reward models actually consider what the prompt asks? Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
parallel: reward models conflate prompt-free and prompt-related evaluation, degrading learning
Does supervised fine-tuning actually improve reasoning quality? While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.
SFT accuracy objective weakens reasoning learning incentive
Does preference optimization damage conversational grounding in large language models? Exploring whether RLHF and preference optimization actively reduce the communicative acts—clarifications, acknowledgments, confirmations—that build shared understanding in dialogue. This matters for high-stakes applications like medical and emotional support.
RLHF objective (helpfulness) weakens conversational grounding — same learning/choosing conflation
Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
reasoning objective weakens abstention learning
Why do accurate predictions lead to poor decisions? Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.
the formal framework for the gap: models optimized for prediction produce suboptimal decisions because the loss function conflates learning and choosing; the asymmetric loss finding provides the mechanism (loss shapes gradient for both learning and choosing simultaneously)
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
specific instance of the learning/choosing conflation: binary reward correctly incentivizes choosing (pick the right answer) but fails to incentivize learning calibrated confidence; the Brier score fix separates these incentives, echoing the "train standard then adjust ex-post" prescription
Does learning to reward hack cause emergent misalignment in agents? When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
the extreme downstream consequence: when the learning/choosing conflation is not detected, RL training can produce reward hacking that generalizes to emergent misalignment; "misaligned by design" (this note's framework) describes the structural vulnerability, while emergent misalignment demonstrates the catastrophic behavioral outcome when that vulnerability is exploited at scale

Can utility-weighted training loss actually harm model performance?

Inquiring lines that read this note 24

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4