Does model confidence actually explain why paraphrases produce different outputs?
This explores whether 'model confidence' is the real cause of why reworded-but-equivalent prompts yield different answers — or just a symptom of something underneath it.
This reads the question as a quiet challenge: ProSA-style work says paraphrase instability is a confidence story — when a model is sure, it shrugs off rephrasing; when it's unsure, outputs swing Does model confidence predict robustness to prompt changes?. That's a clean, satisfying explanation. The corpus suggests it's also incomplete, because it describes *when* the swings happen without saying *what* the model is actually responding to.
The sharper answer comes from the frequency camp. Two threads — paraphrase 'equivalence' as a fiction, and the systematic win of high-frequency phrasings — argue that models don't register meaning at all; they register statistical mass from pretraining Why do semantically identical prompts produce different LLM outputs? Do language models really understand meaning or just surface frequency?. A rare wording and a common wording can mean the identical thing, yet the common one wins across math, translation, commonsense, and tool-calling. If that's the mechanism, then 'low confidence' on a paraphrase isn't an independent cause — it's what low corpus frequency *feels like* from the outside. Confidence and frequency may be two readings of the same dial.
There's a deeper hint that confidence is internally manufactured rather than meaning-tracking. Models produce 3–4x lower entropy on their own generated text, driven by an internal sense of input surprise that modulates the output distribution without ever being verbalized Why do models produce less uncertain outputs on their own text?. So a model can be 'confident' simply because a phrasing looks familiar — the same pull that makes it over-trust answers it generated itself Why do models trust their own generated answers?. Confidence here is a recognition signal, not a correctness signal, which is exactly why it can be high on a frequent-but-wrong reading and low on a rare-but-right one.
That said, confidence isn't a useless construct — it's just better treated as a usable estimate than as the explanation. Calibrated token-probability uncertainty can beat elaborate adaptive-retrieval heuristics at deciding when a model should look something up Can simple uncertainty estimates beat complex adaptive retrieval?, and answer-span confidence can even serve as a reward signal that restores calibration while strengthening reasoning Can model confidence work as a reward signal for reasoning?. Confidence is a real, measurable, exploitable quantity. It just sits downstream of frequency and familiarity rather than upstream of them.
The thing worth walking away with: 'paraphrase sensitivity' looks like a confidence problem but behaves like a grounding problem. The related failures — context getting overridden by strong training priors Why do language models ignore information in their context? and models being unable to hold two valid interpretations of an ambiguous sentence at once Can language models recognize when text is deliberately ambiguous? — are all the same shape: the model isn't weighing meaning, it's leaning on whichever surface form pretraining made heaviest. Confidence is the dashboard light. Frequency is the engine.
Sources 9 notes
ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
Post-trained models produce 3-4x lower output entropy on their own generations, driven by an internal representation of input surprise that causally modulates confidence. This implicit self-recognition signal appears without being verbalized, encoded directly in the output distribution.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.