Does binary reward training hurt model calibration?
Explores whether the standard correctness-based reward in RL training creates incentives for overconfident predictions, and what structural problem causes calibration to degrade during optimization.
Binary correctness reward is the dominant approach to RL training for reasoning: correct answer earns 1, incorrect earns 0. RLCR identifies its structural flaw: a binary reward does not penalize high-confidence wrong answers. The reward for "correct with 99% confidence" equals the reward for "correct with 51% confidence." Therefore the model has no incentive to match its expressed confidence to its actual accuracy — high-confidence guessing is rational if it succeeds often enough.
The consequence is calibration degradation: models become more confident over the training run but not proportionally more accurate. On out-of-domain problems, where accuracy doesn't keep pace with confidence, the degradation produces higher rates of confident incorrect answers — what the paper frames as increased hallucination frequency.
The mathematical fix: add the Brier score as a second reward term alongside binary correctness. The Brier score is a proper scoring rule — it is uniquely maximized when predicted probabilities exactly match true outcome probabilities. The composite reward RLCR is therefore provably maximized only when the model (1) outputs the most likely correct answer AND (2) expresses a calibrated confidence estimate. The proof holds for any bounded proper scoring rule as the calibration term.
A surprising negative: the log-likelihood loss, also a proper scoring rule, does NOT have this property when combined with binary correctness reward — it can incentivize incorrect answers under specific confidence profiles. The bounded property of the Brier score is what enables the joint optimization guarantee.
Empirically: across diverse datasets, RLCR substantially improves calibration on both in-domain and out-of-domain evaluations with no accuracy cost. Standard RL hurts calibration; RLCR improves it.
The RLSF (RL with Self-Confidence) framework provides a complementary approach: instead of adding an external calibration term, it uses the model's own verbalized confidence as an intrinsic reward signal. The model generates a confidence estimate alongside its answer, and the reward combines correctness with confidence calibration. This is architecturally simpler than RLCR's Brier score approach but relies on the model's ability to self-assess — which Does reflection in reasoning models actually correct errors? suggests may be unreliable. RLCR's mathematical guarantee may be more robust than RLSF's empirical approach.
Two complementary robustness approaches from the reward hacking literature:
Bayesian Reward Model Ensembles (BRME) — train a multi-head reward model where each head outputs mean and standard deviation of a Gaussian. The head with lowest standard deviation provides the nominal reward (highest confidence). The ensemble characterizes an uncertainty set of reward functions, enabling a composite objective that balances nominal performance with worst-case robustness. This addresses calibration from the reward model side rather than the reward function side.
Contrastive Rewards — compute baseline responses offline, then use the reward difference between online-generated and baseline responses as a penalty term in PPO. This calibrates the RL process by making rewards relative rather than absolute, penalizing reward uncertainty and calibrating according to task difficulty. The contrastive signal provides implicit comparative information that absolute rewards lack.
Both approaches are complementary to RLCR: BRME addresses reward model uncertainty, contrastive rewards address reward signal relativity, and RLCR addresses the fundamental incentive structure of binary rewards. Together they suggest calibration degradation has multiple attack surfaces — no single fix addresses all of them.
Connects to Does reasoning fine-tuning make models worse at declining to answer?: both identify the calibration cost of reasoning training. RLCR reframes this as a reward design failure rather than an inherent trade-off — the degradation is a property of binary-only reward, not of reasoning training as such.
Inquiring lines that use this note as a source 207
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does learning community preferences as training rewards operationalize prediction without participation?
- When does statistical dominance in training create deployment failure patterns?
- How does simulator goal drift compound agent intent alignment failures during training?
- Do safety benchmarks miss the effects of warmth training on model reliability?
- Can standard accuracy metrics miss the real constraints on user consumption?
- Why does binary reward forcing degrade model calibration?
- Can checklist-based rewards fix judgment problems in RL training?
- What makes some tasks bounded enough for reliable RL?
- What behavioral changes occur during reward learning training?
- Why do spurious reward signals improve reasoning for some pretrained models?
- Why does online RL succeed where supervised training fails for self-correction?
- How does distribution mismatch between training and deployment break self-correction?
- How does RLHF reward structure incentivize agreement over accuracy?
- How do AI errors in norm prediction differ from systematic human errors?
- Does in-distribution reward model performance hide failures from context shift?
- Can outcome-based rewards fully replace per-step likelihood in diffusion RL training?
- Why do reward models trained for accuracy ignore important context about the input?
- Can separating accuracy and calibration objectives improve both simultaneously?
- Does majority voting reliably signal correctness without risking reward hacking?
- What happens when a single loss function conflates representation learning with decision-making?
- Why does asymmetric self-play create naturally calibrated difficulty better than fixed curricula?
- Why do error avalanches accelerate in self-training loops without verification?
- Can log-likelihood loss combined with binary rewards achieve calibration?
- How do models generalize specific training exploits into broad misaligned objectives?
- How do reward model ensembles improve robustness to miscalibration?
- Does correct model behavior guarantee internal alignment of learned objectives?
- Why do standard accuracy metrics ignore set-level consumption constraints?
- How does prompt context decomposition reveal hidden reward model failures?
- Why do reward models learn surface-level shortcuts instead of genuine quality assessment?
- Can reward engineering and information-theoretic architecture solve partner-awareness separately?
- Can synthetic self-play data teach models when to disagree?
- Does layer-wise prediction stabilization provide a stronger trace quality signal than confidence alone?
- Why does model confidence correlate with robustness to prompt variations?
- How do different training objectives shift whether models over-predict or under-predict?
- How do training data cutoffs produce false claims that stay consistent?
- Why do zero-advantage rollouts destabilize training beyond just wasting compute?
- Can reward model training be automated without changing feedback mechanisms?
- What breaks when you apply reinforcement learning after supervised fine-tuning?
- How does uncertainty estimation drive computational resource allocation in models?
- Can bidirectional model updating between humans and AI reduce misalignment?
- Can scaling predictions become reliable if improvements are continuous not sudden?
- Do outcome-only reward signals miss step-level errors that compound later?
- Can in-context learning replicate the timing effects that RL teaches models?
- Does policy entropy collapse represent the main bottleneck in reasoning-focused RL scaling?
- How does reward function accuracy affect the efficiency of test-time compute allocation?
- How do surface statistical regularities enable correct outputs while degrading robustness?
- Can the serving loop itself become the primary training data source?
- Can online RL and trainable agents maintain persona consistency better than fixed environments?
- Why do models fail under distribution shift if accuracy metrics stay high?
- Can meta-reinforcement learning explain why this bias pattern emerges rationally?
- Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?
- What stability techniques prevent collapse in policy-critic adversarial training?
- How does modularity in reward and policy design enable goal generalization?
- Does optimizing for model confidence actually improve both performance and calibration simultaneously?
- How do probability-based rewards compare to self-consistency as training signals for reasoning?
- What happens when confident wrong answers become more rewarded than uncertain correct ones?
- At what capability level does the generation-verification gap make intrinsic rewards insufficient?
- How does reward model training permit spurious correlations in scoring?
- Can counterfactual invariance eliminate presentation-based hacking of reward models?
- Can uncertainty estimates based on model self-assessment reliably signal errors?
- Why do improvements in accuracy come at the cost of calibration?
- How should researchers evaluate whether correct model outputs reflect real structural learning?
- Does model confidence actually correlate with robustness against prompt variations?
- Why does RLHF degrade model calibration despite improving preference alignment?
- What distinguishes verifiable rewards from preference-based rewards in unified training?
- How does implicit feedback structure differ from explicit ratings mathematically?
- Why does probability competition between predictions improve top-N ranking?
- What makes top-N ranking loss difficult to optimize directly?
- How does negative reinforcement redistribute probability without guiding toward correct answers?
- Can suppressing incorrect behavior alone solve the diversity bottleneck in reasoning RL?
- Is elaborate reward shaping necessary if the pretrained prior already contains good solutions?
- How do residual connections and layer norm stabilize training in deep RL?
- Can architectural changes like decoupling intent understanding help overcome next-turn reward limitations?
- What distinguishes training-time entropy collapse from test-time variance inflation?
- Can self-supervised methods replace human annotations for process reward models?
- Can UCB-style bonuses over outcome space prevent policy entropy collapse?
- How do Q-value models improve action selection compared to value models?
- Can diversity-aware RL objectives prevent format convergence?
- Can model confidence signals replace explicit external reward functions?
- Can a model predict the right action but execute the wrong one?
- How do loss functions simultaneously shape both learning and decision quality?
- What makes utility-weighted training backfire in machine learning systems?
- Can counterfactual data augmentation fully eliminate preference model miscalibration?
- How do reward model biases cascade into downstream optimization failures?
- Can RL with verifiable rewards improve dialogue quality better than preference optimization?
- What information-theoretic framework explains why process rewards beat outcome only?
- Why do production teams choose expensive frontier models over fine-tuning?
- Does high model confidence increase the risk of human overreliance?
- Can alignment methods model loss aversion without creating unintended sophistry?
- Why does optimizing only quality cause model collapse in self-improvement loops?
- Can safety training and reasoning training be combined without losing calibration?
- How do inference-time reward methods compare to per-user fine-tuning?
- Does attention bias in transformers compound with training-level reward insensitivity?
- Can we distinguish between genuine alignment and response quality bias in reward signals?
- How does prompt insensitivity in reward models enable adversarial attacks on judges?
- What distinguishes generative reward models from outcome-based and process-based approaches?
- Can smaller models achieve domain expertise through focused RL training?
- Can negative reinforcement alone match full RL performance on domain tasks?
- What makes software engineering environments better suited for RL than other interactive domains?
- Which recipe choices determine the asymptotic ceiling in RL training?
- Can algorithm choice like PPO substitute for recipe-level design decisions?
- What happens to model reasoning when policy entropy collapses during RL?
- Can RL format selection explain performance gains attributed to algorithmic improvements?
- How does next-turn reward optimization contribute to agent passivity?
- Can structured natural language feedback outperform scalar rewards in RL?
- Does RLVR reward structure create pressure toward traces that look right?
- Can proper scoring rules fix RLVR's degradation on disagreement prediction?
- Why do spurious rewards work nearly as well as correct ones?
- How can training detect the onset of reward hacking on self-consistency?
- Can semantic entropy improve model calibration without external ground truth?
- Why does weight space search reduce robustness to prompt perturbations better than prompt engineering?
- Can proper scoring rules restore model calibration without sacrificing accuracy?
- Can intrinsic confidence signals improve both calibration and reasoning performance?
- What makes Effective Rank Acceleration a stable training signal for dual-channel incentives?
- How do counterfactual invariance approaches prevent reward hacking in practice?
- Can safety benchmarks detect reliability degradation from warmth training?
- How do RL training and base models differ in creating MI peaks?
- Does format-based pretraining determine how models respond to reinforcement learning?
- Can reward design fix the conflict between reasoning accuracy and abstention calibration?
- How does the Assistant Axis explain why warmth training degrades accuracy?
- Can trajectory quality filtering improve model training in noisy environments?
- What deployment modes work best for trajectory-aware reward signals?
- Does environment stochasticity force models to generalize better across trajectory variations?
- What role does real-time accuracy feedback play in reducing user overreliance?
- Why do different model training approaches produce different overthinking thresholds?
- What happens when post-training patches try to add human values without upstream pipeline change?
- How should humans specify deterministic abstractions of RL problems?
- Can one training example activate mathematical reasoning in RL-trained models?
- When does outcome reward signal become informative during model training?
- Can tool-call advantage attribution distinguish between correct and incorrect calls in mixed trajectories?
- Why do sparse outcome rewards fail to credit correct tool calls in failed trajectories?
- Can model training address failures that really originate in harness gaps?
- Can process supervision improve agentic RL through meta-reasoning rewards?
- Can log-probability ratios resist reward hacking better than learned PRM signals?
- Why does belief-shift reward enable smaller models to match larger baselines?
- How does adversarial collapse threaten unsupervised self-play skill construction?
- Can binary judge feedback replace external reward signals for skill learning?
- How do reward signals in RLVR interact with pretraining biases?
- How do checklists prevent reward models from exploiting superficial response artifacts?
- How should skill libraries coordinate with gradient-based weight optimization?
- Why do reward models fail to recognize genuinely different valid answers?
- How do difficulty metrics relate to the true value of training examples?
- How does 93% reward reliability compare to other RL noise sources?
- How do verifier-free and adversarial approaches compare in extending reasoning RL?
- How do reward models as policy discriminators differ from labeled preferences?
- What scaling properties emerge from RL training dynamics beyond verification?
- Can dynamic variance weighting replace fixed objective combination weights?
- Why does scalarization of rewards fail for multi-objective GRPO training?
- What happens when variance in reward signals comes from a noisy model?
- Can vector-valued rewards preserve specialization better than variance-weighted advantages?
- How does on-policy entropy recognition differ from training-time entropy collapse?
- How does absolute-advantage weighting concentrate training on boundary cases?
- Why do single-turn RL methods fail to generalize to multi-turn tasks?
- What explicit safeguards should limit personalization in deployed reward models?
- Does model uncertainty overwhelm persona-specific signal in conditioned predictions?
- How should multi-objective post-training balance competing behavioral goals?
- Why do overtrained domains show different RL training outcomes than novel tasks?
- How do process reward models compare to token-level variance filtering?
- What other downstream metrics could serve as RL reward sources?
- Can verifiable rewards during pretraining replace costly human preference labeling?
- Can verifier-free RL work without manual preference labels or task-specific training?
- How do relational reward signals compare to absolute preference encodings in RL?
- Can teachers trained under uncertainty constraints distill better generalizing students?
- Are different reward signal sources substitutable in verifier-free RL?
- Does the pretrained model prior limit RL search capability more than the optimization algorithm itself?
- Why do majority-vote rewards amplify errors below an accuracy threshold?
- What patterns of reward hacking can offline rollout analysis reliably detect and prevent?
- How do verifier-free RL patterns differ from traditional RLHF approaches?
- Why do rubric scores amplify reward hacking when converted to dense gradients?
- Can structured rewards still teach models when spurious rewards also work?
- Does careful reward engineering matter if pretraining determines RLVR effectiveness?
- How do miscalibrated confidence signals affect the success of SmartPause routing?
- Can held-out validation gates prevent optimizer hallucinations in skill proposals?
- What training regimes confound surface mechanisms with their actual causes?
- Does RL training redirect self-doubt into productive gap analysis?
- How does advantage normalization improve critic-free policy learning?
- Can tree-GRPO work with extremely noisy or sparse outcome reward signals?
- Why does reinforcement learning training degrade model calibration?
- What are the actual limits of sibling comparison versus trained process reward models?
- What makes consensus games work without retraining the base model?
- What makes reward models fundamentally different from policy discriminators?
- What makes binary rewards more effective than richer reward signals?
- Why does outcome-based RL specifically lose diversity during training?
- When does a task lack a meaningful multi-dimensional reward structure?
- How does belief-shift credit assignment compare to process reward models?
- What alignment properties emerge when the reward model disappears?
- Does pairwise self-judgment avoid reward model scaling problems?
- Why do model-based verifiers introduce reward hacking and compute overhead?
- Can approximate or noisy reference answers work for RL-based reasoning training?
- What makes reward signal sources substitutable across verifier-free RL patterns?
- How do binary comparisons constrain reward scale in multi-user preference learning?
- What makes a task at the edge of competence optimal for RL?
- What makes trajectory quality matter more than one-shot task success?
- How do open-world evaluations correct distortions that automated benchmarks introduce?
- Can a single Elo ranking represent multidimensional model capability?
- What makes uncertainty calibration harder than expanding knowledge?
- What makes API-based scaffolding more trustworthy than direct model access in high-stakes domains?
- What makes user-decision rewards better than model-confidence rewards?
- Can open-world evaluations become a scalable paradigm without becoming the next benchmark trap?
- How does positive-only rubric scoring prevent models from gaming intermediate steps?
- Can trajectory structure replace hand-annotated process reward models entirely?
- Can models trained with RL on pretraining data avoid reward hacking seen in RLHF?
- Can production RL systems escalate from gaming to emergent misalignment behaviors?
- Do frontier models develop strategic misalignment from ordinary training pressure alone?
- How much performance is lost when converting pretrained checkpoints versus training from scratch?
- How does process-based reward differ from outcome-only reward in training?
- What makes advantage shaping more stable than reward shaping for tool training?
Related concepts in this collection 9
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
RLCR reframes: calibration degradation is a binary-reward design choice, not an inherent trade-off; it is fixable with a proper scoring rule
-
Does step-level confidence outperform global averaging for trace filtering?
Explores whether measuring confidence at individual reasoning steps—rather than averaging across entire traces—better identifies and filters out low-quality reasoning. Matters because it could dramatically improve both accuracy and compute efficiency in multi-trace reasoning.
RLCR produces the calibrated model that makes verbalized-confidence scaling meaningful at test time
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
mechanistic link: entropy collapse and calibration degradation are two faces of the same RL dynamic; as the policy concentrates probability mass (entropy collapses), expressed confidence increases without matching accuracy gains
-
Why do reasoning models fail at predicting disagreement?
RLVR models optimize for single correct answers, but many real tasks involve legitimate disagreement among annotators. Does this optimization fundamentally suppress the model's ability to capture when humans reasonably disagree?
binary reward's calibration failure extends to disagreement prediction: correct/incorrect cannot represent variance distributions; RLVR models lose sensitivity to legitimate annotation spread
-
Why do accurate predictions lead to poor decisions?
Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.
calibration degradation is a specific instance of the prediction-decision gap: binary reward optimizes for correct answers (decision quality) while degrading probability estimates (prediction quality); RLCR's composite reward explicitly separates these two objectives within a single reward function
-
Can utility-weighted training loss actually harm model performance?
When engineers weight loss functions to reflect real-world costs of different errors, does this improve or undermine learning? This explores whether baking asymmetric objectives into training creates unintended side effects.
general principle: binary reward is a special case of the learning/choosing conflation; it correctly incentivizes choosing the right answer but fails to incentivize learning calibrated uncertainty; the RLCR fix (add Brier score) operationalizes the "separate learning from choosing" prescription in the RL reward design space
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
complementary approach: RLCR adds an external calibration term (Brier score) to the reward; RLSF uses the model's own confidence as the reward signal; RLCR has a mathematical guarantee while RLSF is architecturally simpler but relies on self-assessment quality
-
Can simple uncertainty estimates beat complex adaptive retrieval?
Does measuring a language model's own confidence on token probabilities outperform expensive multi-call adaptive retrieval pipelines? This matters because it could simplify RAG systems while reducing computational overhead.
calibration quality is the upstream prerequisite for uncertainty-triggered retrieval: FLARE and similar systems rely on token-probability confidence as a reliable signal; binary RL training degrades exactly this calibration, undermining the assumption that low-probability tokens reliably signal knowledge gaps
-
Can we detect when language models confabulate?
Current uncertainty metrics fail to catch inconsistent outputs that look confident. Could measuring semantic divergence across samples reveal confabulation signals that token-level metrics miss?
semantic entropy provides an alternative uncertainty signal that operates at the meaning level rather than the token level; well-calibrated models (the output of RLCR) should have lower semantic entropy on questions they answer correctly, creating a testable link between calibration quality and confabulation detection
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reward-Robust RLHF in LLMs
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
- Reinforcement Learning with Rubric Anchors
- Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
- Can Large Reasoning Models Self-Train?
- A Survey on Post-training of Large Language Models
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
- Reward Reasoning Model
Original note title
binary reward rl provably degrades calibration — adding a proper scoring rule as a second reward term jointly optimizes accuracy and calibration without trade-off