SYNTHESIS NOTE

Topics›Natural Language Inference›this note

Why do semantically identical prompts produce different LLM outputs?

Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.

Synthesis note · 2026-05-02 · sourced from Natural Language Inference

Cao et al. (2024) showed prompts with the same meaning give very different output quality. Adam's Law isolates frequency as a primary variable in that variance: when paraphrase pairs are matched on meaning but differ on sentence-level corpus frequency, the higher-frequency variant systematically wins. This converts a known phenomenon — prompt sensitivity — from a vague reliability concern into a specific architectural claim about what the model is actually responding to.

The implication for Does model confidence predict robustness to prompt changes? is direct but complicating. Confidence-based accounts read prompt sensitivity as model uncertainty fluctuating across surface variations. Adam's Law inserts a deeper variable: even at fixed model confidence, frequency mass differs across paraphrases because pre-training exposure differs, and that exposure asymmetry shapes the prediction independent of how confident the model "feels." Confidence and frequency are entangled, but frequency is the more upstream cause.

For a Language-as-Event frame, this is load-bearing. A prompt is not a transparent vessel that hands meaning to the model. It is a token sequence whose statistical mass relative to pre-training shapes how the model parses the request before any semantic interpretation occurs. Two synonymous sentences are not the same event. They are two different statistical encounters that happen to share a meaning a human would assign them. The model registers the encounter; meaning is what we read into the registration. This connects to Can models pass tests while missing the actual grammar? — when surface and meaning compete, surface wins by construction.

A practical corollary: prompt-engineering as a discipline is partly a folk practice of frequency optimization. "Phrase it like a textbook" or "rewrite the prompt the way StackOverflow would phrase it" are intuitive moves toward higher-frequency surface forms. Adam's Law gives that folk practice a name and a mechanism — and a warning, because frequency-tuning a prompt does not improve the model's reasoning; it just moves the request into the model's denser distributional region.

Inquiring lines that read this note 33

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do prompt structure and constraints affect model instruction reliability?

Can prompting inject entirely new knowledge into language models?

Can prompting strategies overcome LLM biases without model fine-tuning?

What prevents language models from reliably adopting diverse personas?

Why do LLM regenerations produce meaningfully different personalities from the same prompt?

Why do language models reinforce false assumptions instead of correcting them?

When does optimizing for quality undermine the value of diversity?

How does tokenization toward corpus mean affect downstream output diversity?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Why do LLMs generate logical forms without preserving semantic content?

Do language models understand semantics or rely on pattern matching?

How much semantic meaning survives when LLMs paraphrase poetry and literary text?

What factors beyond surface content determine how readers extract meaning differently?

Why do different readers extract different meanings from identical text?

What critical LLM failures do standard benchmarks hide?

Why do different LLMs converge on nearly identical outputs?

Can model confidence signals reliably improve reasoning quality and calibration?

Does model confidence actually explain why paraphrases produce different outputs?

Why do language models struggle with implicit discourse relations?

Can the same predicate generate different projection strength in different contexts?

How can identical external performance mask different internal representations?

What happens when prompt-optimized results lack anchoring in real data?

Why do reasoning models fail at systematic problem-solving and search?

Why does output alignment fail to catch internally incoherent reasoning?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How do adversarial and manipulative prompts attack reasoning models?

Why do paraphrasing defenses fail against subliminal prompt attacks?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

What happens when we treat LLM outputs as sampled rather than stored?

What makes weaker teacher models effective for stronger student training?

What alignment procedures cause different models to share the same output distribution?

What drives capability and cost efficiency in agent systems?

Can two agents with identical token counts produce vastly different outputs?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can similar outputs from different systems prove they work the same way?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 116 in 2-hop network ·medium cluster Open in graph ↗

Why do semantically identical prompts produce di… Does model confidence predict robustness to prompt… Can models pass tests while missing the actual gra…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does model confidence predict robustness to prompt changes? Explores whether a model's certainty about its answer determines how much it resists prompt rephrasing and semantic variation. This matters because it could explain why some tasks are harder to evaluate reliably.
confidence framing complicated by frequency as deeper variable
Can models pass tests while missing the actual grammar? Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
surface dominates when surface and meaning compete

Why do semantically identical prompts produce different LLM outputs?

Inquiring lines that read this note 33

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4