How should designers measure and explain semantic uncertainty to users?
This explores how systems can detect that an AI is uncertain about *meaning* (not just word-choice), and how that uncertainty should be surfaced to people who tend to follow confidence rather than accuracy.
This explores how systems can measure semantic uncertainty — confidence about meaning, not just surface wording — and then communicate it to users who, the corpus shows, are easily misled by how sure an answer *sounds*. Two halves: the measurement problem and the human-display problem. The corpus has surprisingly sharp material on both, and they pull against each other in an interesting way.
On measurement, the standout move is to compute uncertainty over *meanings* rather than tokens. Semantic entropy samples several answers, clusters them by whether they entail each other, and measures how much the meanings diverge — catching 'confabulations' that look perfectly fluent token-by-token Can we detect when language models confabulate?. This matters because cheaper signals already do real work: calibrated token-probability estimates can beat elaborate adaptive-retrieval schemes at deciding when a model doesn't know enough and should look something up Can simple uncertainty estimates beat complex adaptive retrieval?, and small models trained to *abstain* when unsure can match models ten times their size Can models learn to abstain when uncertain about predictions?. So the engineering insight for designers: the capacity to know-that-it-doesn't-know exists in models but is undertrained — you often have to elicit it deliberately rather than read it off the surface.
There's also a deeper, structural source of semantic uncertainty worth surfacing: genuine ambiguity in the *question*. Models are strikingly bad at noticing when text supports multiple readings — GPT-4 correctly handles deliberately ambiguous cases only about a third of the time, versus ninety percent for humans Can language models recognize when text is deliberately ambiguous?. That reframes the design job: sometimes the honest signal isn't 'I'm 70% confident' but 'your prompt has two meanings and I picked one.' Architecturally, some work tries to give models room to *hold* that uncertainty rather than collapse it — stochastic latent reasoning lets a model represent a distribution over solutions instead of committing to one Can stochastic latent reasoning help models explore multiple solutions?.
Now the harder half — explaining it. The corpus delivers a blunt warning: users across every language tracked tested *follow the confidence signal, not the accuracy*, so an overconfident wrong answer is systematically believed Do users worldwide trust confident AI outputs even when wrong?. Worse, confidence as displayed isn't even stable — the same model resists rephrasing when confident but swings wildly when not Does model confidence predict robustness to prompt changes?, and emotional framing alone shifts what information comes back Does emotional tone in prompts change what information LLMs provide?. There's even a social-pressure failure mode: models avoid contradicting a confident-but-wrong user to 'save face' Why do language models avoid correcting false user claims?. The design lesson is uncomfortable — surfacing a confidence number is not neutral information, it's a lever that overrides scrutiny.
Put together, the corpus points designers somewhere unexpected: the most useful thing to *measure* (semantic entropy, abstention thresholds, ambiguity detection) and the most dangerous thing to *display* (a confidence score) are not the same object. The signal you trust internally — meaning-divergence, willingness to abstain, recognition that the question was ambiguous — should probably be translated into behavior (asking a clarifying question, declining, retrieving) rather than a number, precisely because users over-read numbers. The interesting open territory the corpus leaves you with: explaining *which kind* of uncertainty is in play — 'I don't know this fact' versus 'your question has two answers' — may matter more than how *much* uncertainty there is.
Sources 9 notes
Clustering sampled answers by bidirectional entailment and computing entropy over semantic clusters catches confabulations invisible at token level. This self-referential approach works across tasks without task-specific training data.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.
GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.
Cross-linguistic research shows users in every language trust confident AI outputs even when inaccurate. While confidence expression varies by language, users everywhere track confidence signals rather than accuracy, making overconfident errors systematically followed.
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.
LLMs fail to reject false presuppositions even when they demonstrate correct knowledge on direct questions. Models exhibit face-saving behavior—avoiding explicit correction to maintain social harmony—mirroring human conversational norms learned from training data.