Can model confidence alone replace external answer verification?
Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.
RLVR's reliance on domain-specific verifiers confines it to math and code. Two complementary approaches extend RLVR to general domains by replacing external verification with intrinsic signals.
RLPR (Reinforcement Learning with Reference Probability Reward) uses the LLM's own token probability of generating a reference answer as the reward signal. The probability reflects how well the reasoning process leads to the correct answer and measures how likely the model is to take the correct action. Two key innovations: (1) a Probability-based Reward computed from average decoding probabilities of reference answer tokens, showing better robustness than naive sequence likelihood, and (2) stabilization methods to address the high variance inherent in probability-based rewards. RLPR consistently improves reasoning across Gemma, Llama, and Qwen models on both general-domain and mathematical benchmarks.
INTUITOR goes further: it uses the model's own confidence — self-certainty measured as average KL divergence between the output distribution and a uniform distribution — as its sole reward signal. No reference answers, no external verifiers, no labeled data. The approach is simple: replace the verifiable reward in GRPO with self-certainty scores. The mechanism builds on the observation that LLMs exhibit lower confidence on difficult problems; optimizing for confidence should drive the model toward more reliable reasoning.
Both approaches raise the same fundamental question for future AI: as models develop capabilities beyond human evaluation, self-generated signals may be the only viable training pathway. Since Can model confidence work as a reward signal for reasoning?, there is convergent evidence that intrinsic confidence signals can serve dual roles — improving both performance and reliability.
Since Can reasoning improvement work without answer verification?, RLPR and INTUITOR represent the next step: progressively weaker assumptions about what external signal is needed, from reference verification to reference probability to pure self-certainty.
Inquiring lines that use this note as a source 48
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can LLMs evaluate their own observations without external feedback?
- What verification methods work for knowledge without stable referents?
- Can external verification systems fix what self-verification cannot accomplish?
- What calibration corrections can reduce LLM judge bias in automated evaluation pipelines?
- Should validation responsibility move away from the primary user?
- Why do users systematically overrely on confident LLM outputs across languages?
- How does step-level confidence filtering compare to global confidence averaging?
- Can single models correct their own beliefs without amplifying confidence in wrong answers?
- Do models actually self-assess their confidence or just confirm answers?
- How do we assign confidence and polarity scores to belief edges?
- Can utility control modify LLM values more effectively than output filtering?
- Can external verifiers replace reasoning trace quality in solution guarantees?
- How do calibration and reliability differ in LLM judge evaluations?
- Why does external verification stop error amplification but internal self-assessment enable it?
- Why do diffusion LLM answer tokens converge in confidence long before reasoning stabilizes?
- Can fact-checking systems use LLMs reliably if models abandon correct positions under pressure?
- Does optimizing for model confidence actually improve both performance and calibration simultaneously?
- What happens when confident wrong answers become more rewarded than uncertain correct ones?
- Does exposure to more domain-specific examples reduce LLM overconfidence?
- How does self-revision on wrong answers increase model confidence further?
- Can uncertainty estimates based on model self-assessment reliably signal errors?
- Which use cases can tolerate unverified LLM outputs without external verification?
- What makes accurate confidence different from confident-but-wrong predictions?
- How much does confidence-guided cascading between SAS and MAS improve accuracy?
- Does internalizing verifiers actually close the generation-verification gap?
- Why does prompt sensitivity vanish when model confidence is high?
- What planning tasks benefit most from combining LLM generation with external verification?
- Can intrinsic confidence signals improve both calibration and reasoning performance?
- How does model confidence relate to accuracy in underfitted domains?
- Does majority voting prevent confident but incorrect answers from being reinforced?
- Why does regenerating LLM responses produce different but equally valid answers?
- Can exchange value persist without use value being verified first?
- What role do verifiers play in stabilizing extended reasoning at test time?
- Can confidence levels reliably detect when a model is overthinking?
- Do LLMs need world models to make accurate predictions?
- Does the verification gap widen exactly where judgment replaces checkability?
- Can step-level confidence filtering work better than global confidence scoring?
- What makes out-of-band monitoring better than in-band verification loops?
- What breaks when a mis-synthesized verifier runs with high confidence?
- Why does moving verifier synthesis to the LLM extend verification beyond math and code domains?
- Can verifier output replace ground-truth answers as the asymmetric information source?
- Why does self-verification fail but external process verification work?
- Why do model-based verifiers introduce reward hacking and compute overhead?
- Can LLMs express uncertainty in ways that preserve epistemic honesty?
- Why does self-critique fail without external verification signals?
- Can question-only features replace model uncertainty checks at scale?
- What makes uncertainty calibration harder than expanding knowledge?
- What makes API-based scaffolding more trustworthy than direct model access in high-stakes domains?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
convergent: confidence as reward improves both performance and calibration
-
Can reasoning improvement work without answer verification?
Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.
RLPR/INTUITOR extend this progression
-
Does self-consistency reliably reward correct answers during training?
Self-consistency initially correlates with correctness, but as models train on this signal, do they eventually learn to maximize consistency itself rather than accuracy? When does this proxy reward stop working?
risk: confidence-based rewards may select for confident errors
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
intrinsic rewards face the same ceiling
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- RLPR: Extrapolating RLVR to General Domains without Verifiers
- Learning to Reason without External Rewards
- Post-Training Large Language Models via Reinforcement Learning from Self-Feedback
- Escaping the Verifier: Learning to Reason via Demonstrations
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- Reward Reasoning Model
Original note title
llm intrinsic probability of generating a correct answer can replace external verifiers as reward signal — extending rlvr to general domains