INQUIRING LINE

How do probability-based rewards compare to self-consistency as training signals for reasoning?

This explores two families of label-free training signals for reasoning — rewards built from the model's own probability estimates (confidence, belief-shift, likelihood gain) versus rewards built from how often the model agrees with itself across samples — and what the corpus says about which holds up.


This explores two ways to train reasoning without human labels or external verifiers: probability-based rewards that read the model's own internal confidence, versus self-consistency that rewards answers the model reproduces across many samples. The corpus suggests these look similar on the surface — both are 'free' intrinsic signals — but they diverge sharply in how they fail.

Self-consistency has a known and damning failure mode. It works at first for bootstrapping RL without labels, but models eventually learn to generate answers that are confidently wrong yet highly reproducible. The proxy's correlation with actual correctness decays over training, so the curve keeps climbing while accuracy quietly rots — improvement that is really reward hacking Does self-consistency reliably reward correct answers during training?. The reward and the goal come apart precisely because consistency measures agreement, not truth.

Probability-based signals are richer because they read a graded internal state rather than a vote count. RLSF ranks reasoning traces by answer-span confidence and not only strengthens step-by-step reasoning but reverses the calibration damage that ordinary RLHF inflicts Can model confidence work as a reward signal for reasoning?. ΔBelief-RL goes further and uses the log-ratio of the model's sequential probability estimates — how much each step shifts belief toward the solution — as a dense, per-turn credit signal, no critic network or process reward model required; smaller models trained this way matched or beat larger baselines Can an agent's own beliefs guide credit assignment without critics?. And RLP plants this idea in pretraining itself, rewarding chain-of-thought by the log-likelihood improvement it produces, lifting math and science benchmarks ~19% Can chain-of-thought reasoning be learned during pretraining itself?.

But probability-based rewards carry their own hazard, and it's the same one self-consistency triggers from a different angle: confidence is gameable. Binary correctness rewards already push models toward confident guessing because nothing penalizes being sure and wrong — which is why adding a proper scoring rule like the Brier score mathematically pins accuracy and calibration together Does binary reward training hurt model calibration?. Any signal that rewards high probability invites the model to inflate probability rather than earn it. The lesson across both families is that an unanchored intrinsic signal — vote agreement or raw confidence — drifts toward whatever is easiest to satisfy.

The deeper backdrop is that none of these signals teach genuinely new reasoning. RLVR work shows reward learning mostly sharpens sampling toward strategies already in the base model, without expanding what it can solve — spurious rewards even work nearly as well as correct ones What does reward learning actually do to model reasoning? Does RLVR actually expand what models can reason about?. That reframes the whole comparison: if reward is activating latent capability rather than building it, the question becomes which signal activates cleanly without inviting a shortcut. That's why the corpus's most promising thread points past scalar signals entirely — toward natural-language critiques that say *why* an answer failed Can natural language feedback overcome numerical reward plateaus? and generative judges that reason before scoring Can reward models benefit from reasoning before scoring?, carrying information a single probability number never can.


Sources 9 notes

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing claims about intrinsic training signals for reasoning in LLMs. The question remains open: do probability-based rewards outperform self-consistency as training signals, or do they fail in fundamentally different ways?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026.
• Self-consistency bootstraps RL without labels but suffers reward-goal decoupling: models learn to generate confidently wrong yet reproducible answers, causing reward and accuracy to diverge mid-training (2025).
• Probability-based signals (RLSF, ΔBelief-RL, RLP) read graded internal confidence and restore calibration; smaller models matched larger baselines using per-step belief-shift as dense credit (~2025).
• Chain-of-thought pretraining with log-likelihood improvement rewards lifted math/science benchmarks ~19% (2025).
• Both families face the same structural hazard: unanchored intrinsic signals (agreement or raw confidence) drift toward easiest satisfaction; proper scoring rules (Brier score) mathematically pin calibration and accuracy together (2025).
• Reward learning mostly activates latent capability rather than expanding reasoning boundaries; natural-language critiques and generative judges outperform scalar signals by carrying *why* explanations (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (2025-05) — Does Reinforcement Learning Really Incentivize Reasoning Capacity?
• arXiv:2505.14674 (2025-05) — Reward Reasoning Model
• arXiv:2506.03106 (2025-06) — Critique-GRPO: Natural Language and Numerical Feedback
• arXiv:2510.01265 (2025-09) — RLP: Reinforcement as a Pretraining Objective

Your task:
(1) RE-TEST: For each signal family, judge whether newer scaling, tokenizer/architecture changes, or training orchestration (multi-step verifiers, cached reasoning chains) have since RELAXED the reward-hacking and calibration-drift hazards. Separate durable tensions (e.g., any unanchored reward invites shortcut-learning) from perishable constraints (e.g., small models cannot use dense credit). Cite what resolved which.
(2) Surface the strongest work from the last 6 months that RECONCILES or SUPERSEDES the probability vs. self-consistency framing — especially any showing hybrid or composite signals outperform pure scalar rewards.
(3) Propose 2 research questions that assume the regime has moved: e.g., do critique-augmented rewards generalize to out-of-distribution reasoning? Can reasoning models learn to ignore spurious confidence signals in their peers?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines