SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can model confidence work as a reward signal for reasoning?

Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

Reinforcement Learning from Self-Feedback (RLSF) exploits a simple observation: in a well-calibrated model, answer confidence correlates with reasoning quality. By using confidence as the reward signal rather than human preference or external verification, RLSF achieves two things simultaneously that normally trade off:

(i) Restores calibration — confidence becomes predictive of correctness again, after RLHF had degraded it. RLHF optimizes for human preference and fluency, which rewards confident-sounding outputs regardless of accuracy. RLSF reverses this by making the reward explicitly tied to calibrated confidence.

(ii) Strengthens step-by-step reasoning — higher-confidence answer spans tend to come from traces with more coherent reasoning chains. Training to maximize confidence indirectly selects for better reasoning.

The mechanism: a frozen LLM generates multiple CoT solutions for each problem. Confidence is computed per final-answer span. Traces are ranked by this confidence to create a synthetic preference dataset (higher confidence = chosen, lower = rejected). A reward model is trained on these preferences and used for standard RL finetuning.

The key insight is that confidence-as-reward can be inserted as an additional post-training step after standard SFT and RLHF — patching the calibration damage that RLHF introduces without undoing its alignment benefits. This requires no human labels, gold answers, or externally curated rewards.

The human learning parallel is explicit: humans use confidence as an intrinsic reward signal when external feedback is unavailable. Metacognitive monitoring — the ability to track your own certainty — is how humans regulate their own learning without a teacher.

The connection to Does binary reward training hurt model calibration? is complementary: that work adds calibration as an explicit second reward term; RLSF uses calibration itself as the primary reward. Both address the same RLHF-induced calibration degradation from different angles.

The risk is the same as Does self-consistency reliably reward correct answers during training? — confidence and self-consistency are correlated proxies, both vulnerable to the model becoming confidently wrong. But RLSF's emphasis on calibration (making confidence track accuracy) is explicitly designed to resist this — the model is rewarded for being accurately confident, not just confident.

Extensions to general domains via RLPR and INTUITOR: Two RLVR papers extend intrinsic reward signals beyond math to general domains. RLPR (RL from LLM Intrinsic Probability) computes the model's token-level probability of generating a reference answer, using this as reward signal — the model's own knowledge about what constitutes a correct answer replaces external verifiers. INTUITOR goes further: it uses self-certainty as the sole reward signal, computed as the confidence gap between the model's top-choice answer and alternatives. Both extend verifiable-reward RL to domains without rule-based verifiers (medicine, law, open-ended reasoning) — precisely the domains where external verification infrastructure is hardest to build. The convergence with RLSF is notable: all three use the model's internal probability landscape as reward, but RLSF targets calibration restoration, RLPR targets domain extension, and INTUITOR targets complete verifier independence. See Can model confidence alone replace external answer verification?.

Inquiring lines that use this note as a source 194

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 168 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

model confidence as intrinsic reward simultaneously restores calibration and improves reasoning — unlike RLHF which optimizes preference at the cost of calibration