INQUIRING LINE

How does semantic entropy compare to confidence scores from internal model probabilities?

This explores two different ways of measuring whether an LLM 'knows' something: semantic entropy (sampling several answers and measuring how much their meanings diverge) versus confidence read directly off the model's internal token probabilities — and what each one catches that the other misses.


This explores two rival approaches to the same problem — knowing when to trust a model's answer — that operate at different levels. Confidence scores read uncertainty straight off the model's output distribution: how probable were the tokens it produced? Semantic entropy ignores raw probabilities and instead samples multiple answers, clusters them by whether they mean the same thing, and measures how scattered those meanings are. The key insight in Can we detect when language models confabulate? is that a model can be lexically uncertain (many ways to phrase the same correct fact) while being semantically confident, and vice versa — so token-level probability and meaning-level agreement can disagree. Semantic entropy catches 'confabulations' that look perfectly fluent and high-probability at the token level.

That said, the corpus makes a surprisingly strong case that plain token-probability confidence is more useful than its simplicity suggests. Can simple uncertainty estimates beat complex adaptive retrieval? finds that calibrated token-probability uncertainty beats far more elaborate adaptive-retrieval schemes at deciding when a model needs to look something up — and at a fraction of the compute. Semantic entropy's weakness is exactly its cost: it requires sampling many full answers and running entailment comparisons between them, whereas internal-probability confidence is essentially free, available in a single forward pass. So the comparison is partly accuracy-versus-cost: semantic entropy sees a failure mode token confidence is blind to, but you pay for every measurement.

The more interesting twist is that both methods can be fooled in the same way, because both are downstream of the model's own self-assessment. Can pretraining data statistics detect hallucinations better than model confidence? shows models can be confidently wrong on entity combinations they never saw in training — the symptom (low confidence) never appears, so neither raw probability nor sampled-answer agreement flags it. Their fix sidesteps the model entirely and looks at co-occurrence statistics in the training data, catching the root cause rather than the symptom. That's a third lane: data-side signals that don't depend on the model knowing what it doesn't know.

There's also reason to be skeptical of internal confidence as a clean signal. Why do models produce less uncertain outputs on their own text? shows models produce 3–4× lower entropy on their own generated text than on external text — an implicit self-recognition effect that deflates uncertainty for reasons unrelated to correctness. And Does model confidence predict robustness to prompt changes? finds confidence tracks robustness to prompt rephrasing, meaning the same fact can register as more or less 'confident' depending on surface phrasing. Both findings suggest token-probability confidence measures something about the model's relationship to its own output distribution, not purely about truth — which is precisely the gap semantic entropy tries to close by working over meanings instead of tokens.

Worth knowing too: confidence isn't only a detection signal, it can be a training signal. Can model confidence work as a reward signal for reasoning? uses answer-span confidence to rank reasoning traces and build preferences without human labels — and in doing so reverses the calibration damage RLHF normally causes. So the contrast isn't just 'which detector is better.' Internal confidence is cheap, reusable as a reward, but distorted by self-recognition and phrasing; semantic entropy is meaning-aware and catches fluent confabulations but is expensive; and both can be blindsided by knowledge gaps that only the training data itself reveals.


Sources 6 notes

Can we detect when language models confabulate?

Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Why do models produce less uncertain outputs on their own text?

Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **How do semantic entropy and token-probability confidence compare as uncertainty signals—and has the frontier moved since early 2025?** This is still open.

What a curated library found—and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable.
- Semantic entropy catches fluent confabulations invisible to token-probability confidence by clustering meanings rather than tokens; token confidence is 3–4× underestimated on the model's own outputs due to self-recognition bias (~2026).
- Plain token-probability confidence beats elaborate adaptive-retrieval schemes at deciding when to retrieve, at a fraction of the compute; but both fail on knowledge gaps absent from training data (~2025).
- Confidence tracks prompt robustness, not purely correctness; it measures the model's relationship to its own output distribution (~2025).
- Token confidence can be recycled as intrinsic reward for preference learning, reversing RLHF calibration damage; semantic entropy is expensive (requires sampling + entailment) (~2024–2025).
- Models are confidently wrong on unseen entity combinations; neither signal flags this without data-side co-occurrence priors (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2401.06855 (Fine-grained Hallucination Detection, 2024-01)
- arXiv:2501.12835 (Adaptive Retrieval Without Self-Knowledge, 2025-01)
- arXiv:2605.25459 (Self-Recognition & Self-Reaction in LLMs, 2026-05)
- arXiv:2605.01428 (Hallucinations & Metacognition, 2026-05)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For the 3–4× entropy gap, self-recognition bias, and semantic-vs-token disagreement: have newer model families, post-training methods (RL from self-feedback, online learning, enaction-based training), or evaluations since mid-2025 narrowed or overturned these? Separate the durable question (when and why do confidence types diverge?) from perishable claims (specific magnitude gaps, cost trade-offs). Cite what narrowed each gap.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last 6 months. Does arXiv:2605.25459 (self-enaction) or arXiv:2506.09038 (AbstentionBench) or recent metacognition work reframe the trade-off entirely?
(3) **Propose 2 research questions** that assume the regime has shifted: e.g., "If post-trained models now self-calibrate entropy via enaction, can semantic entropy be learned rather than sampled?" or "Does training on high-entropy minority tokens (~2506.09522) make confidence scores more meaning-aware?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines