SYNTHESIS NOTE

Does training objective determine which direction models fail at abstention?

Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.

Synthesis note · 2026-02-23 · sourced from Alignment

LLM abstention calibration fails in both directions depending on the training objective, not the model's general capability:

Reasoning-trained models under-abstain. RL/RLHF training for reasoning optimizes answer generation. Abstention is penalized because "I don't know" receives no reward. Since Does reasoning fine-tuning make models worse at declining to answer?, the result is overconfident models that answer when they shouldn't.

Safety-trained models over-abstain. RLHF with safety emphasis raises uncertainty thresholds too high. Models refuse benign prompts or decline complex but answerable open-ended tasks. TrustLLM demonstrates safety-training-driven over-refusal on completely safe questions.

Base models split by domain complexity. In simple templated tasks, base models calibrate reasonably. In complex open-ended domains (legal reasoning, medical diagnosis), base models set their uncertainty threshold too conservatively, under-answering questions they could handle.

The implication: "calibration" is not a single axis that can be fixed by one technique. The training objective creates a characteristic failure signature. A model that was tuned for both reasoning and safety faces contradictory calibration pressures — one pushes toward answering, the other toward refusing. This may explain why reasoning fine-tuning degrades abstention: it actively counteracts the safety training's conservative bias. A potential resolution exists: Does binary reward training hurt model calibration?, suggesting the axis conflict can be addressed at the reward design level.

For post-writing: connects to "the critical thinking problem" (reasoning training optimizes narrow thinking while degrading meta-cognitive judgment about when not to think) and the broader theme that training optimizes a target metric while degrading adjacent capabilities.

Inquiring lines that read this note 11

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can model confidence signals reliably improve reasoning quality and calibration?

Can language model RL training avoid reward hacking and misalignment?

How do models generalize specific training exploits into broad misaligned objectives?

Does alignment training create blind spots in detecting genuine safety threats?

What role does terminal goal guarding play in model misalignment?

How do self-generated feedback mechanisms enable effective model learning?

How do different training objectives shift whether models over-predict or under-predict?

What prevents language models from reliably adopting diverse personas?

Do training objectives directly determine the ENFJ default across models?

Can AI systems balance emotional competence with factual reliability?

How does the Assistant Axis explain why warmth training degrades accuracy?

What constrains reinforcement learning's ability to expand model reasoning?

What failure modes do imitation and outcome methods each address?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

What trade-offs emerge between training objectives and model reliability?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 194 in 2-hop network ·dense cluster Open in graph ↗

Does training objective determine which directio… Does reasoning fine-tuning make models worse at de… Does binary reward training hurt model calibration… Can models identify what information they actually… Does AI refusal on politics signal ethical restrai… Can three-way rewards fix the accuracy versus abst…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
the primary evidence for reasoning-trained under-abstention
Does binary reward training hurt model calibration? Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
potential resolution via reward design
Can models identify what information they actually need? When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.
under-abstention is especially damaging when tasks are underspecified: models trained to always answer cannot identify what information is missing, creating a compound failure of forced answering on incomplete inputs
Does AI refusal on politics signal ethical restraint or capability limits? When AI models refuse to discuss political topics, is that a sign of principled safety training or a sign they lack the internal concepts to engage? Research on political feature representation suggests the answer may surprise you.
identifies a third mechanism for over-abstention distinct from safety training: models refuse politically complex topics not because of safety constraints but because they lack sufficient internal representation to engage; safety-trained over-abstention (this note) and representation-poverty refusal (that note) produce the same surface behavior from different causes
Can three-way rewards fix the accuracy versus abstention problem? Standard RL forces models to choose between accuracy and honesty about uncertainty. Could treating correct answers, hallucinations, and abstentions as distinct reward outcomes let models learn when to say 'I don't know'?
ternary reward is the direct solution to the bidirectional abstention problem: intermediate reward for abstention gives models a learnable signal that resolves both under-abstention (reasoning) and over-abstention (safety) at the reward design level

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training objective determines abstention direction — reasoning training under-abstains while safety training over-abstains

Does training objective determine which direction models fail at abstention?

Inquiring lines that read this note 11

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4