INQUIRING LINE

Why does binary reward forcing degrade model calibration?

This explores why training models on pass/fail rewards (right vs. wrong, nothing in between) makes their confidence estimates unreliable — and what the corpus suggests as fixes.


This explores why binary correctness rewards — where a model gets +1 for a right answer and 0 for a wrong one, with no middle ground — push models toward overconfident guessing. The mechanism is almost embarrassingly simple once you see it: a binary reward never punishes a confident wrong answer any more than a hesitant one. If guessing and abstaining both score zero when you're wrong, but guessing occasionally scores a point, the math rewards always-guessing. The model learns to be confident everywhere, including where it shouldn't be — and calibration (the match between how sure a model sounds and how often it's right) collapses. Does binary reward training hurt model calibration? shows this isn't a quirk of one setup but a provable consequence, and that adding a Brier score (a 'proper scoring rule' that explicitly penalizes confident errors) as a second reward term fixes it without trading away accuracy.

The deeper issue is that a single binary scalar throws away information the model could have used. Can scalar rewards capture all the information in agent feedback? makes this general point: feedback naturally carries two separable things — how good an action was (evaluative) and how it should change (directive) — and a scalar reward captures only the first. Binary reward is the most extreme compression of that scalar: it flattens the entire spectrum of 'how wrong, and in what way' into a single bit. Calibration is exactly the casualty, because calibration lives in the gradations the bit erased.

The corpus converges on a clear repair strategy: give the reward more than two states. Can three-way rewards fix the accuracy versus abstention problem? adds a third option — correct (+1), hallucination (−1), abstention (somewhere in between) — which makes 'I don't know' a learnable move rather than a guaranteed loss, cutting hallucinations by nearly 29%. Can model confidence work as a reward signal for reasoning? goes further and uses the model's own answer-span confidence as the reward signal, reversing RLHF's calibration damage while sharpening reasoning — and notably without human labels. Both treat calibration not as something to bolt on afterward but as something the reward shape either preserves or destroys.

There's a useful tension worth pulling on here. Does negative reinforcement alone outperform full reinforcement learning? finds that training on negative samples alone — just suppressing wrong trajectories — often matches full RL while preserving the answer diversity that positive-only reinforcement crushes by piling probability mass onto a few favored answers. That probability-mass concentration is calibration collapse seen from another angle: the model becomes peaky and overconfident. So part of why binary reward hurts calibration is the same reason positive-only reinforcement narrows diversity — both reshape the output distribution toward overcommitment.

Worth knowing for the curious: this connects to a broader finding that reward-based RL mostly reshapes *which* answers a model commits to rather than expanding what it can do. What does reward learning actually do to model reasoning? and Does RLVR actually expand what models can reason about? show RLVR improves sampling efficiency by concentrating toward solutions already in the base model — which is precisely the dynamic that, taken too far with a crude binary signal, sacrifices calibrated uncertainty for confident commitment. The fix in every case is to make the reward carry more structure: a proper scoring rule, a third 'abstain' state, or a continuous confidence signal — anything richer than a single bit.


Sources 7 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about reward design and model calibration in LLM alignment. The question remains open: why and when do binary reward signals degrade calibration, and what structural changes to the reward actually fix it?

What a curated library found — and when (dated claims, not current truth): Findings span Sept 2024–Feb 2026.
• Binary correctness rewards (+1/0) provably push models toward overconfident guessing because confident errors score identically to hesitant ones; proper scoring rules (Brier score) repair this without accuracy loss (~2024–25).
• Ternary rewards (correct/hallucination/abstain) cut hallucinations ~29% by making 'I don't know' learnable rather than default-loss (~2025).
• Model's own answer-span confidence as reward signal reverses calibration damage *without human labels* while sharpening reasoning (~2025).
• Negative-only RL (suppressing wrong trajectories) matches or exceeds full RL while preserving answer diversity; positive-only RL concentrates probability mass into peaky, overconfident distributions — the same calibration collapse (~2026).
• RLVR mostly concentrates sampling toward solutions already in the base model rather than expanding reasoning boundaries; crude binary signals amplify this over-commitment (~2025–26).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (Apr 2025) — Does RL expand reasoning beyond the base model?
• arXiv:2506.01347 (Jun 2025) — The Surprising Effectiveness of Negative Reinforcement
• arXiv:2507.14843 (Jul 2025) — The Invisible Leash: Why RLVR May Not Escape Its Origin
• arXiv:2509.25760 (Sep 2025) — TruthRL: Incentivizing Truthful LLMs

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, ask: have *newer models* (GPT-4o, o1, Claude 4), *finer reward shaping* (token-level, rubric gates, reasoning reflection), *better orchestration* (memory, multi-agent feedback), or *stronger evals* (calibration metrics beyond ECE) since relaxed or overturned the calibration–binary-reward link? Separately flag which constraints still visibly hold and cite what resolved which.
(2) Surface the strongest *contradicting* work from the last ~6 months—papers showing binary rewards *do* preserve calibration under certain training regimes, or that the calibration problem is downstream of something else entirely (e.g., base model bias, not reward shape).
(3) Propose 2 research questions that assume the regime may have shifted: (a) Does token-level or rubric-gated reward assignment recover calibration *without* abandoning binary classification? (b) Can a model meta-learn when to trust its own confidence signal vs. defer to a learned calibration layer?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines