INQUIRING LINE

How does linguistic calibration differ from token probability calibration?

This explores the gap between two senses of 'calibration': whether a model's raw token probabilities track correctness, versus whether its uncertainty is expressed at the level of meaning and dialogue — and the corpus shows these can diverge sharply.


This question is really about *where* you measure a model's confidence. Token probability calibration asks a narrow, mechanical thing: when the model assigns a token (or answer span) a probability of 0.8, is it right about 80% of the time? Linguistic calibration asks something broader — does the uncertainty the model conveys in *meaning* and in *conversation* match what it actually knows? The corpus is interesting precisely because it shows these two layers come apart.

The cleanest demonstration is semantic entropy Can we detect when language models confabulate?. A model can be perfectly fluent and token-confident while confabulating, because the same false claim can be phrased many ways — each individual phrasing looks probable. Only when you sample multiple answers and cluster them by *meaning* (does answer A entail answer B?) does the real uncertainty surface. That's the heart of the distinction: token-level calibration is blind to confabulations that meaning-level calibration catches. The unit of measurement changes the answer.

That said, token-probability calibration is not a poor cousin — it's often shockingly useful. Calibrated token uncertainty beats elaborate adaptive-retrieval heuristics at deciding *when* a model should go look something up, and at a fraction of the cost Can simple uncertainty estimates beat complex adaptive retrieval?. Answer-span confidence can even be recycled as a training reward that simultaneously sharpens reasoning and *restores* the calibration that RLHF tends to erode Can model confidence work as a reward signal for reasoning?. So token probabilities carry real signal — the catch is that fine-tuning for human preference degrades it, which is exactly why people reach for meaning-level measures as a check.

There's a third layer the question quietly points at: calibration as a *conversational act*. Humans calibrate by building common ground — asking clarifying questions, repairing misunderstandings mid-dialogue. LLMs largely skip this, operating in 'static grounding' mode where they answer immediately rather than negotiating what was meant Why do language models skip the calibration step?. This is linguistic calibration in its richest sense: not just *expressing* uncertainty accurately, but *acting* on it by slowing down. Speech and dialogue systems learned this lesson long ago — with 15–30% recognition error rates, they had to maintain belief distributions over user intent rather than commit to one reading Why do dialogue systems need probabilistic reasoning?.

Why do the two layers diverge at all? Because the model's surface output is a sample, not a commitment. A model holds a superposition of plausible continuations and samples one at generation time — regenerate and you get a different, equally confident-sounding answer Do large language models actually commit to a single character?. Token probability calibrates the *sampling distribution*; linguistic calibration tries to calibrate the *thing being claimed*. The reader's takeaway: a model that sounds well-calibrated word-by-word can still be badly calibrated about what it means — and the only way to see that is to stop reading tokens and start comparing meanings.


Sources 6 notes

Can we detect when language models confabulate?

Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Why do language models skip the calibration step?

LLMs operate in static grounding mode—retrieving data and responding without clarification loops. Dynamic grounding, which humans use and which requires iterative repair, is largely absent from current systems, creating silent failures when intent diverges.

Why do dialogue systems need probabilistic reasoning?

Real-world speech recognition achieves 15-30 percent error rates in noisy environments, making deterministic flowchart dialogue systems unworkable. POMDP-based systems handle this by maintaining belief distributions over user intent rather than committing to single interpretations.

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a calibration researcher re-testing whether token probability and linguistic calibration remain distinct constraints in current LLMs. The question: do modern models still exhibit the divergence between confident-looking tokens and meaning-level uncertainty, or have recent training methods, prompting, or architecture changes collapsed the gap?

What a curated library found — and when (dated claims, not current truth): Studies spanning 2019–2026 identify calibration at two measurable layers:
• Semantic entropy reveals confabulations invisible to token-level confidence: a model can assign high probability to each token in a false claim because the same lie paraphrases smoothly (~2024–2025).
• Token-probability calibration outperforms heuristic adaptive-retrieval at deciding when to retrieve, at lower cost, even while the model hallucinates downstream (~2025).
• RLHF degrades token-probability calibration; meaning-level measures (entailment clustering, semantic entropy) recover the signal that token probabilities no longer carry (~2024–2025).
• Models operate in 'static grounding' mode, answering immediately rather than building common ground or maintaining belief distributions over intent, unlike dialogue systems operating at 15–30% ASR error (~2025–2026).
• Regeneration tests show models sample from superposition; the same input yields different high-confidence outputs, proving tokens calibrate distribution, not commitment (~2024).

Anchor papers (verify; mind their dates):
• 2401.06855 (Fine-grained Hallucination Detection, Jan 2024)
• 2501.12835 (Adaptive Retrieval & Uncertainty, Jan 2025)
• 2506.08952 (Grounding & Political Questions, Jun 2025)
• 2507.21931 (RL from Self-Feedback, Jul 2025)

Your task:
(1) RE-TEST: For each layer (token confidence, semantic entropy, adaptive retrieval, static vs. dynamic grounding), determine whether post-Jul-2025 models, multi-turn prompting, agentic loops with reflection, or new RL schemes (esp. self-feedback, intrinsic-reward fine-tuning) have narrowed or erased the gap. Flag which constraints still hold and which may have shifted.
(2) Surface strongest work from last ~6 months contradicting or superseding the token/linguistic split — e.g., unified calibration schemes, dialogue-aware training, or evidence that meaning-level uncertainty now drives token selection.
(3) Propose two research questions assuming the regime may have moved: (a) Can self-feedback RL recover token calibration *and* prevent hallucination simultaneously? (b) Do multi-agent dialogue systems trained on LLMs (with negotiation loops) naturally learn dynamic grounding?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines