How do we assign confidence and polarity scores to belief edges?
This reads as a question about how to put numbers on a belief graph — how confident an edge is, and whether it points toward support or contradiction — and the corpus doesn't tackle belief graphs literally, but it has a lot to say about where 'confidence' and directional belief signals actually come from inside a model.
This reads as a question about scoring the edges of a belief graph — a confidence weight (how strong) and a polarity (support vs. contradiction). The corpus has no paper that builds belief graphs explicitly, but it offers two ingredients you'd need: a credible source of confidence, and a credible source of direction. The most direct doorway is ΔBelief-RL, which scores an agent's per-turn belief movement as the log-ratio of its own sequential probability estimates Can an agent's own beliefs guide credit assignment without critics?. That log-ratio is exactly a signed, weighted edge: its sign is polarity (did this turn move belief toward or away from the target?) and its magnitude is confidence in that movement — and it's computed from the model's own intrinsic probabilities, no external critic or labeler needed.
That 'use the model's own probability as the score' move recurs across the collection under different names. RLPR and INTUITOR treat intrinsic token probability of a correct answer as a reward signal that can replace an external verifier Can model confidence alone replace external answer verification?, and RLSF ranks reasoning traces by answer-span confidence to build preferences Can model confidence work as a reward signal for reasoning?. The lesson for edge-weighting: a probability difference is your raw confidence number, but raw probabilities are badly calibrated unless you treat calibration as a first-class goal — RLSF's headline is that confidence-as-reward *restored* calibration that ordinary RLHF degrades.
The sharper warning is about granularity. If you average confidence over a whole reasoning trace, you smear over exactly the breakdowns you care about; step-level confidence catches local failures that global averaging masks Does step-level confidence outperform global averaging for trace filtering?. Translated to belief edges: score each edge locally rather than inheriting one global confidence for the whole graph, or a single bad link will hide inside an otherwise confident-looking structure. ReBalance pushes this further by reading confidence *variance and overconfidence* as diagnostic signals, not just point values Can confidence patterns reveal overthinking versus underthinking? — so an edge weight might be richer than a scalar: spread and overconfidence carry information too.
Now the caution, because it's the thing you didn't know to ask. High confidence is not the same as being right. Deterministic decoding produces the *same* output every time without making it a reliable draw Does setting temperature to zero actually make LLM outputs reliable?, and models can be confidently wrong precisely on the entity combinations they never saw in pretraining — data statistics flag that risk better than confidence does Can pretraining data statistics detect hallucinations better than model confidence?. So a confidently-scored belief edge is most suspect exactly where the underlying claim is novel. Confidence also predicts robustness: confident outputs resist prompt rephrasing, low-confidence ones swing wildly Does model confidence predict robustness to prompt changes? — which gives you a free validity check, score an edge under paraphrased prompts and watch whether the polarity flips.
Finally, if these scores come from human or LLM annotation rather than intrinsic probability, the annotations themselves aren't one clean signal. They decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, distinguishable by consistency across measurement conditions Do all annotation responses measure the same underlying thing?. The practical takeaway for polarity: a single annotation pass conflates 'I believe this edge' with 'I made up an answer because you asked' — measuring the same edge under varied conditions is how you separate a real signed belief from noise. So the corpus's combined answer: derive magnitude from calibrated, *local* intrinsic probability differences, derive sign from belief-shift log-ratios, and trust neither where the claim is novel or the consistency check fails.
Sources 9 notes
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
ReBalance uses confidence variance and overconfidence as diagnostic signals to apply training-free steering vectors that reduce overthinking redundancy while promoting exploration during underthinking, improving accuracy across models from 0.5B to 32B parameters.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.