How does semantic entropy compare to confidence scores from internal model probabilities?
This explores two different ways of measuring whether an LLM 'knows' something: semantic entropy (sampling several answers and measuring how much their meanings diverge) versus confidence read directly off the model's internal token probabilities — and what each one catches that the other misses.
This explores two rival approaches to the same problem — knowing when to trust a model's answer — that operate at different levels. Confidence scores read uncertainty straight off the model's output distribution: how probable were the tokens it produced? Semantic entropy ignores raw probabilities and instead samples multiple answers, clusters them by whether they mean the same thing, and measures how scattered those meanings are. The key insight in Can we detect when language models confabulate? is that a model can be lexically uncertain (many ways to phrase the same correct fact) while being semantically confident, and vice versa — so token-level probability and meaning-level agreement can disagree. Semantic entropy catches 'confabulations' that look perfectly fluent and high-probability at the token level.
That said, the corpus makes a surprisingly strong case that plain token-probability confidence is more useful than its simplicity suggests. Can simple uncertainty estimates beat complex adaptive retrieval? finds that calibrated token-probability uncertainty beats far more elaborate adaptive-retrieval schemes at deciding when a model needs to look something up — and at a fraction of the compute. Semantic entropy's weakness is exactly its cost: it requires sampling many full answers and running entailment comparisons between them, whereas internal-probability confidence is essentially free, available in a single forward pass. So the comparison is partly accuracy-versus-cost: semantic entropy sees a failure mode token confidence is blind to, but you pay for every measurement.
The more interesting twist is that both methods can be fooled in the same way, because both are downstream of the model's own self-assessment. Can pretraining data statistics detect hallucinations better than model confidence? shows models can be confidently wrong on entity combinations they never saw in training — the symptom (low confidence) never appears, so neither raw probability nor sampled-answer agreement flags it. Their fix sidesteps the model entirely and looks at co-occurrence statistics in the training data, catching the root cause rather than the symptom. That's a third lane: data-side signals that don't depend on the model knowing what it doesn't know.
There's also reason to be skeptical of internal confidence as a clean signal. Why do models produce less uncertain outputs on their own text? shows models produce 3–4× lower entropy on their own generated text than on external text — an implicit self-recognition effect that deflates uncertainty for reasons unrelated to correctness. And Does model confidence predict robustness to prompt changes? finds confidence tracks robustness to prompt rephrasing, meaning the same fact can register as more or less 'confident' depending on surface phrasing. Both findings suggest token-probability confidence measures something about the model's relationship to its own output distribution, not purely about truth — which is precisely the gap semantic entropy tries to close by working over meanings instead of tokens.
Worth knowing too: confidence isn't only a detection signal, it can be a training signal. Can model confidence work as a reward signal for reasoning? uses answer-span confidence to rank reasoning traces and build preferences without human labels — and in doing so reverses the calibration damage RLHF normally causes. So the contrast isn't just 'which detector is better.' Internal confidence is cheap, reusable as a reward, but distorted by self-recognition and phrasing; semantic entropy is meaning-aware and catches fluent confabulations but is expensive; and both can be blindsided by knowledge gaps that only the training data itself reveals.
Sources 6 notes
Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.