SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Does training objective determine which direction models fail at abstention?

Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.

Synthesis note · 2026-02-23 · sourced from Alignment
Why does chain-of-thought reasoning fail in predictable ways? How do you add domain expertise without losing general reasoning?

LLM abstention calibration fails in both directions depending on the training objective, not the model's general capability:

Reasoning-trained models under-abstain. RL/RLHF training for reasoning optimizes answer generation. Abstention is penalized because "I don't know" receives no reward. Since Does reasoning fine-tuning make models worse at declining to answer?, the result is overconfident models that answer when they shouldn't.

Safety-trained models over-abstain. RLHF with safety emphasis raises uncertainty thresholds too high. Models refuse benign prompts or decline complex but answerable open-ended tasks. TrustLLM demonstrates safety-training-driven over-refusal on completely safe questions.

Base models split by domain complexity. In simple templated tasks, base models calibrate reasonably. In complex open-ended domains (legal reasoning, medical diagnosis), base models set their uncertainty threshold too conservatively, under-answering questions they could handle.

The implication: "calibration" is not a single axis that can be fixed by one technique. The training objective creates a characteristic failure signature. A model that was tuned for both reasoning and safety faces contradictory calibration pressures — one pushes toward answering, the other toward refusing. This may explain why reasoning fine-tuning degrades abstention: it actively counteracts the safety training's conservative bias. A potential resolution exists: Does binary reward training hurt model calibration?, suggesting the axis conflict can be addressed at the reward design level.

For post-writing: connects to "the critical thinking problem" (reasoning training optimizes narrow thinking while degrading meta-cognitive judgment about when not to think) and the broader theme that training optimizes a target metric while degrading adjacent capabilities.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 204 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training objective determines abstention direction — reasoning training under-abstains while safety training over-abstains