INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›Can model confidence signals relia…›this inquiring line

Pass/fail grading trains AI to be confidently wrong — a lucky guess earns the same reward as a careful one.

Can log-likelihood loss combined with binary rewards achieve calibration?

This explores whether the standard training recipe — log-likelihood loss plus simple right/wrong (binary) rewards — can produce a model that *knows what it knows*, i.e. whose stated confidence matches its actual accuracy.

This explores whether the usual training combo of log-likelihood loss and binary correctness rewards can give you a well-calibrated model — and the corpus's sharpest answer is no, not on its own. Binary rewards have a built-in flaw: they only ask 'was the final answer right?', so they never punish a model for being confidently wrong. The result is that training actively pushes models toward high-confidence guessing, because a lucky confident guess pays the same as a careful one. Does binary reward training hurt model calibration? shows this degradation is provable, not incidental — and that the fix is to add a *proper scoring rule* (the Brier score) as a second reward term, which mathematically lets accuracy and calibration improve together rather than trading off.

The deeper point is that the reward signal, not the likelihood loss, is doing the damage. A loss function inherits whatever objective you point it at. Can utility-weighted training loss actually harm model performance? makes this vivid from the opposite direction: when you bend the loss toward a decision-oriented (utility-weighted) objective, you sharpen *choosing* but you weaken the model's underlying representation, because you've starved the gradients that teach it real structure. The lesson cuts both ways — coarse, outcome-only signals (a binary reward, an asymmetric loss) optimize the thing they measure and quietly corrupt the things they don't, calibration being the usual casualty.

There's a more elegant route than bolting on a penalty: make the model's own confidence part of the reward. Can model confidence work as a reward signal for reasoning? (RLSF) ranks reasoning traces by the model's answer-span confidence, building synthetic preferences that both strengthen step-by-step reasoning *and* reverse the calibration damage that ordinary RLHF inflicts — without human labels or external verifiers. So calibration isn't a tax you pay against capability; with the right signal the two move in the same direction.

Worth knowing as you go deeper: the *shape* of the reward matters as much as its content. Does negative reinforcement alone outperform full reinforcement learning? finds that training only on wrong answers (suppressing them) preserves output diversity, whereas positive-only reinforcement concentrates probability mass and collapses the spread across higher-k sampling — and that spread is exactly what honest uncertainty looks like. Meanwhile a whole line of work argues that richer reward signals beat thin binary ones in general: Can judges that reason about reasoning outperform classifier rewards? and Can reward models benefit from reasoning before scoring? show that rewards which *reason* about the work, rather than just stamp it correct/incorrect, raise the ceiling on what the model can learn.

One caution before you trust any calibration number: Does setting temperature to zero actually make LLM outputs reliable? reminds us that a confident, repeatable output is still just one draw from a distribution — consistency is not the same as being right, and a model can be reliably miscalibrated. So the honest answer to the question is: log-likelihood plus binary rewards *can't* get you calibration by themselves, but log-likelihood plus a properly-scored or confidence-aware reward can.

Sources 7 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Show all 7 sources

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

RM-R1: Reward Modeling as Reasoning2.65 match · arxiv ↗
Understanding and Mitigating Premature Confidence for Better LLM Reasoning2.61 match · arxiv ↗
Reward Reasoning Model2.59 match · arxiv ↗
StepWiser: Stepwise Generative Judges for Wiser Reasoning1.78 match · arxiv ↗
Reasoning Language Models: A Blueprint1.75 match · arxiv ↗
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning1.74 match · arxiv ↗
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge1.72 match · arxiv ↗
Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty1.71 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a calibration researcher re-testing constraints on reward design in LLM training. The question: can log-likelihood loss + binary rewards achieve calibration?

What a curated library found — and when (dated claims, not current truth):
Library findings span 2024–2025. Key claims:
- Binary rewards provably degrade calibration by ignoring confidence, pushing models toward high-confidence guessing (2024–2025).
- Proper scoring rules (Brier score) or confidence-aware rewards mathematically decouple accuracy–calibration tradeoff (2024–2025).
- Negative reinforcement alone (suppressing wrong answers) preserves output diversity better than positive-only RL, which concentrates probability mass (2025).
- Reward models that reason step-by-step outperform binary classifiers; reasoning-based rewards extend test-time compute scaling (2025).
- Deterministic outputs ≠ calibration; consistency is not reliability (2024).

Anchor papers (verify; mind their dates):
- arXiv:2506.01347 (2025-06): The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
- arXiv:2505.14674 (2025-05): Reward Reasoning Model
- arXiv:2508.19229 (2025-08): StepWiser: Stepwise Generative Judges for Wiser Reasoning
- arXiv:2511.07699 (2025-11): Misaligned by Design: Incentive Failures in Machine Learning

Your task:
(1) RE-TEST EACH CONSTRAINT. For binary rewards degrading calibration: do newer post-training schemes (e.g., DPO, IPO, or confidence-conditioned rewards from 2025-onward models) empirically relax this? Does reasoning-based reward modeling (2025) change the regime? Separate the durable insight (reward shape matters) from the perishable limitation (binary rewards are always bad).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers claiming binary rewards *do* preserve calibration under certain settings, or showing proper scoring rules don't scale, or arguing diversity loss is a worse problem than calibration.
(3) Propose 2 research questions assuming the regime has moved: (a) Can confidence-in-process (mid-reasoning) signals outperform answer-span confidence as a reward anchor? (b) Does calibration improve *monotonically* with reasoning-based rewards, or do richer signals hit their own saturation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Pass/fail grading trains AI to be confidently wrong — a lucky guess earns the same reward as a careful one.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8