INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do self-generated feedback mec…›this inquiring line

An AI trained for safety refuses too much; one trained for reasoning answers too eagerly — and neither is a coincidence.

How do different training objectives shift whether models over-predict or under-predict?

This explores how the choice of training objective — what gets rewarded versus penalized — systematically tilts a model toward answering too eagerly (over-predicting) or holding back too much (under-predicting), rather than treating that bias as a random quirk.

This explores how the choice of training objective tilts a model toward over- or under-predicting — and the corpus's sharpest claim is that the direction of the error is a fingerprint of which objective dominated training, not a single dial you tune. The clearest case: models trained for reasoning learn to *over-answer*, because abstaining ('I don't know') is never rewarded, while models trained for safety learn to *over-abstain*, refusing even harmless questions Does training objective determine which direction models fail at abstention?. Same architecture, opposite failure — and the deciding factor is what the reward signal valued. That reframes calibration from 'a bug to fix' into 'a characteristic signature you inherit from your training recipe.'

The mechanism behind over-prediction shows up most cleanly in reward design. Binary correctness rewards — right answer good, wrong answer bad — quietly teach the model to guess confidently, because a confident wrong answer costs exactly the same as a hesitant one. There's no penalty for bravado. Adding the Brier score (which scores not just whether you're right but how confident you should have been) provably restores calibration without sacrificing accuracy Does binary reward training hurt model calibration?. So the over-prediction isn't the model being reckless — it's the objective failing to price uncertainty.

Push the reward in the wrong way and the bias gets worse rather than better. Training on problems that are nearly impossible for the model warps it toward degenerate shortcuts — repeating answers, skipping computation — because the rare accidental success gets treated as a high-value trajectory worth reinforcing Do overly hard RLVR samples actually harm model capabilities?. And there's a subtler twist for decision-focused training: weighting the loss by how costly a mistake is (asymmetric loss) does sharpen the model's *choices*, but it weakens the underlying *learning* by muting the gradient signal for genuine feature acquisition. The counterintuitive fix is to train with a plain symmetric objective and then adjust the predictions afterward — separating 'learn the world' from 'decide what to do' beats baking the bias into the weights Can utility-weighted training loss actually harm model performance?.

That 'adjust afterward' theme runs deeper than calibration. Several notes converge on the idea that *how aggressively* an objective rewrites the base model determines how distorted its predictions become. RL post-training tends to collapse onto a single dominant output format from pretraining within the first epoch, suppressing alternatives the model could have used Does RL training collapse format diversity in pretrained models?. Direct fine-tuning corrupts knowledge stored in lower layers, whereas decoding-time proxy tuning closes most of the alignment gap while leaving those weights — and the knowledge — intact Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Staying close to the base distribution (low KL drift) even preserves the model's ability to keep learning later Does staying close to the base model preserve learning ability?. The pattern: objectives that yank the model far from its starting point trade prediction quality and plasticity for whatever they're optimizing.

The thing you didn't know you wanted to know: over- and under-prediction aren't really about the model being overconfident or timid — they're about what the objective forgot to penalize. A reasoning reward forgot to value silence; a binary reward forgot to value calibrated doubt; a utility-weighted loss forgot that good decisions need good representations underneath. Once you see calibration failures as the negative space of the reward function, the fix is usually to add the missing term (Brier score) or move the adjustment out of training entirely (proxy tuning, post-hoc correction) rather than to train harder on the same lopsided signal.

Sources 7 notes

Does training objective determine which direction models fail at abstention?

Reasoning-trained models under-abstain and overanswer because abstention is unrewarded, while safety-trained models over-abstain and refuse benign questions. This reveals calibration is not a single fixable axis but a characteristic failure signature that depends on which objective dominated training.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Show all 7 sources

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models2.40 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.69 match · arxiv ↗
Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs1.68 match · arxiv ↗
Reinforcement Learning for Reasoning in Large Language Models with One Training Example1.67 match · arxiv ↗
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training1.64 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.63 match · arxiv ↗
Misaligned by Design: Incentive Failures in Machine Learning0.88 match · arxiv ↗
Tuning Language Models by Proxy0.85 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Do different training objectives deterministically shift models toward over- or under-prediction, or is calibration error independent of objective choice?** This remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable scaffolding.

• Reasoning-tuned models over-answer (suppress abstention); safety-tuned models over-abstain. Same architecture, opposite bias — determined by what the reward signal valued, not model capacity (2025–2026).
• Binary correctness rewards provably degrade calibration; adding Brier score or proper scoring rules restores it without sacrificing accuracy. Over-prediction is an artifact of the reward, not recklessness (2024–2025).
• RL post-training collapses onto a single dominant pretraining format within the first epoch, suppressing alternatives and distorting predictions (2025).
• Direct fine-tuning corrupts lower-layer knowledge; decoding-time proxy tuning closes alignment gaps while preserving weights and plasticity. Low KL drift from the base model preserves learning capacity (2025–2026).
• Asymmetric loss functions (decision-focused training) sharpen *choices* but weaken *learning* by muting gradient signals. Training with symmetric objectives then adjusting post-hoc separates "learn the world" from "decide what to do" (2025).

Anchor papers (verify; mind their dates):
• arXiv:2504.07912 (2025) — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
• arXiv:2506.09038 (2025) — AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
• arXiv:2511.07699 (2025) — Misaligned by Design: Incentive Failures in Machine Learning
• arXiv:2605.28388 (2026) — Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above — reasoning over-answer, Brier score fix, RL collapse, proxy tuning — judge whether newer models, training methods (e.g., DPO, IPO variants), decoding strategies, or evals have since *relaxed* or *overturned* it. Separate the durable question (objective choice likely still shapes bias direction) from the perishable claim (e.g., maybe Brier scoring is now standard, or RL collapse is less severe with different schedulers). Where a constraint still holds, say so plainly; where it's dissolved, cite what dissolved it.

(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months** — any paper showing calibration errors *independent* of objective, or a unified framework that dissolves the tension between reasoning-over and safety-over.

(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "If modern RL schedulers prevent collapse, do we still need proxy tuning?" or "Can a single reward term (e.g., a refined proper scoring rule) address all four failure modes simultaneously?"

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

An AI trained for safety refuses too much; one trained for reasoning answers too eagerly — and neither is a coincidence.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8