INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does fine-tuning modify underlying…›this inquiring line

Training an AI to prioritize your most important examples can quietly make it worse at understanding them.

What makes utility-weighted training backfire in machine learning systems?

This explores why training a model to optimize directly for the outcome you care about — weighting the loss toward high-stakes decisions, correct answers, or high-quality data — can quietly make the model worse at the very thing it's learning.

This explores why training a model to optimize directly for the outcome you care about — weighting the loss toward high-stakes decisions, the reward you want, or the 'best' data — can quietly make the model worse at the very thing it's learning. The corpus points to a recurring mechanism: utility weighting collapses the rich signal a model needs to *learn* down to the narrow signal it needs to *decide*, and the two are not the same job.

The cleanest statement of this is the finding that asymmetric, utility-weighted loss functions strengthen decision-making while actively weakening representation learning — by shrinking the gradient signal for acquiring substantive features, they make the model a better chooser on top of a thinner understanding. The striking fix is to decouple the two: train with a plain symmetric loss, then adjust predictions for utility *afterward*, which beats baking utility into training on the very same utility objective Can utility-weighted training loss actually harm model performance?. The same shape shows up in reinforcement learning with binary correctness rewards: because a binary reward never punishes a confident wrong answer, it teaches the model to guess loudly, wrecking calibration — fixable only by adding a second term (the Brier score) that restores the penalty the utility signal stripped out Does binary reward training hurt model calibration?.

The backfire gets worse when the weighting amplifies rare, lucky events. Training on near-impossible RLVR problems sounds like high-value practice, but group-relative normalization treats a stray accidental success as a high-advantage trajectory and reinforces it — so the model learns answer-repetition and computation-skipping shortcuts that then contaminate skills it already had Do overly hard RLVR samples actually harm model capabilities?. A subtler cousin: positive-only reinforcement concentrates probability mass onto whatever currently scores well, degrading diversity and higher-k performance, whereas suppressing the wrong answers (negative reinforcement) preserves the spread and often matches full RL Does negative reinforcement alone outperform full reinforcement learning?. Weighting toward the winner, it turns out, is exactly what narrows the model.

Utility weighting also backfires when 'high utility' is judged from the outside rather than relative to the learner. Teacher-refined instruction data that is objectively higher quality still *degrades* a student model when it sits beyond the student's learning frontier — the fix is to let the student filter for compatibility with its own profile rather than swallow everything labeled good Does teacher-refined data always improve student model performance?. And when the utility signal is read off the system's own past behavior, it closes a loop: YouTube's multi-objective ranker has to explicitly model selection bias, because without it the model converges on degenerate equilibria that just amplify its own prior decisions Why do ranking systems need to model selection bias explicitly?.

The through-line the reader may not have expected: the most reliable cure across all these cases is *not* a better utility weight but a structural separation — keep the learning signal honest and apply the utility pressure somewhere else. Decode-time proxy tuning preserves pretrained knowledge precisely by never touching the weights that store it Can decoding-time tuning preserve knowledge better than weight fine-tuning?, and difficulty-based data pruning shows weighting *can* help — but only when it removes redundancy rather than chasing the objective directly Can we prune training data without hurting model performance?. Utility weighting backfires when you let the thing you want to optimize stand in for the thing you need to learn.

Sources 8 notes

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Show all 8 sources

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?2.41 match · arxiv ↗
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning1.68 match · arxiv ↗
Reinforcement Learning with Rubric Anchors1.62 match · arxiv ↗
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts1.60 match · arxiv ↗
Foundations of Large Language Models1.60 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.59 match · arxiv ↗
Misaligned by Design: Incentive Failures in Machine Learning0.88 match · arxiv ↗
Beyond neural scaling laws: beating power law scaling via data pruning0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing claims about utility-weighted training backfire. The question: under what conditions does optimizing directly for a desired outcome degrade model capability?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026. A library of ~15 papers identified:
  • Asymmetric loss functions strengthen decision-making but weaken representation learning by shrinking gradient signal for feature acquisition; decoupling training (symmetric loss) from prediction adjustment (utility applied post-hoc) restores performance (~2024).
  • Binary correctness rewards teach confident wrong answers and wreck calibration; adding a Brier-score penalty term restores the signal utility stripped out (~2025).
  • Group-relative normalization on near-impossible RLVR tasks treats stray accidental successes as high-advantage trajectories, inducing answer-repetition shortcuts that contaminate existing skills (~2026).
  • Positive-only reinforcement concentrates probability mass onto current winners, degrading diversity and higher-k performance; suppressing wrong answers (negative RL) preserves spread and often matches full RL (~2025).
  • Teacher-refined instruction data degrades student models when it sits beyond the student's learning frontier; student-side filtering for compatibility outperforms swallowing all "good" data (~2024–2025).
  • Structural separation — keeping learning signal honest and applying utility pressure elsewhere (e.g., decode-time proxy tuning, difficulty-based pruning) — outperforms direct objective baking (~2024–2026).

Anchor papers (verify; mind their dates):
  • arXiv:2409.15360 (2024-09) Reward-Robust RLHF in LLMs
  • arXiv:2506.01347 (2025-06) The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
  • arXiv:2605.28388 (2026-05) Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs
  • arXiv:2511.07699 (2025-11) Misaligned by Design: Incentive Failures in Machine Learning

Your task:
  (1) RE-TEST THE CORE TENSION. For each finding (asymmetric loss, binary rewards, group-relative RLVR, positive-only RL, teacher refinement, structural separation), judge whether post-2026 models, training methods, or evaluation frameworks have RELAXED or OVERTURNED the constraint. Separate the durable question (e.g., "does utility weighting collapse learning signal?") from the perishable limitation (e.g., "binary rewards wreck calibration"). Flag where constraints still hold and cite what has shifted them.
  (2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown utility weighting *can* co-optimize learning and decision-making without structural separation? Does any recent work restore confidence in direct objective baking?
  (3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Can modern scaling or architectural changes (attention, memory, MoE) make utility weighting benign?" or "Does fine-tuning depth (shallow vs. deep LoRA, full WT) change the learning–decision trade-off?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI to prioritize your most important examples can quietly make it worse at understanding them.

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8