Why do semantically identical prompts produce different LLM outputs?
Explores why paraphrases with the same meaning yield different model outputs. This matters because it reveals what LLMs actually respond to during inference—and whether prompt engineering is optimizing meaning or something else.
Cao et al. (2024) showed prompts with the same meaning give very different output quality. Adam's Law isolates frequency as a primary variable in that variance: when paraphrase pairs are matched on meaning but differ on sentence-level corpus frequency, the higher-frequency variant systematically wins. This converts a known phenomenon — prompt sensitivity — from a vague reliability concern into a specific architectural claim about what the model is actually responding to.
The implication for Does model confidence predict robustness to prompt changes? is direct but complicating. Confidence-based accounts read prompt sensitivity as model uncertainty fluctuating across surface variations. Adam's Law inserts a deeper variable: even at fixed model confidence, frequency mass differs across paraphrases because pre-training exposure differs, and that exposure asymmetry shapes the prediction independent of how confident the model "feels." Confidence and frequency are entangled, but frequency is the more upstream cause.
For a Language-as-Event frame, this is load-bearing. A prompt is not a transparent vessel that hands meaning to the model. It is a token sequence whose statistical mass relative to pre-training shapes how the model parses the request before any semantic interpretation occurs. Two synonymous sentences are not the same event. They are two different statistical encounters that happen to share a meaning a human would assign them. The model registers the encounter; meaning is what we read into the registration. This connects to Can models pass tests while missing the actual grammar? — when surface and meaning compete, surface wins by construction.
A practical corollary: prompt-engineering as a discipline is partly a folk practice of frequency optimization. "Phrase it like a textbook" or "rewrite the prompt the way StackOverflow would phrase it" are intuitive moves toward higher-frequency surface forms. Adam's Law gives that folk practice a name and a mechanism — and a warning, because frequency-tuning a prompt does not improve the model's reasoning; it just moves the request into the model's denser distributional region.
Inquiring lines that use this note as a source 33
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does token generation as flow differ from print's archival storage?
- What makes prompt engineering different from the research thinking it replaces?
- What prompt types best extract different aspects of item content?
- How does prompt framing subtly determine what kind of opposing argument an LLM generates?
- What makes the prompt a fundamentally new kind of speech act?
- Why do LLM regenerations produce meaningfully different personalities from the same prompt?
- How much does prompt format shape what reasoning strategy a model uses?
- How do ordering effects compound across different prompt component scales?
- Why does ad-hoc prompt engineering violate scientific method standards?
- Is paraphrase invariance a reliable assumption when deploying language models in production?
- How does tokenization toward corpus mean affect downstream output diversity?
- Why do users rephrase prompts toward median register over specialized phrasing?
- Why do LLMs generate logical forms without preserving semantic content?
- How does prompt design alter what kind of creativity LLMs can express?
- Why do true and false LLM outputs use the same mechanism?
- How much semantic meaning survives when LLMs paraphrase poetry and literary text?
- Why do different readers extract different meanings from identical text?
- Why do different LLMs converge on nearly identical outputs?
- Does model confidence actually explain why paraphrases produce different outputs?
- Can the same predicate generate different projection strength in different contexts?
- What happens when prompt-optimized results lack anchoring in real data?
- Why does output alignment fail to catch internally incoherent reasoning?
- Does LLM reasoning always match the outputs it generates?
- Why do paraphrasing defenses fail against subliminal prompt attacks?
- Why does regenerating LLM responses produce different but equally valid answers?
- What happens when we treat LLM outputs as sampled rather than stored?
- How does decomposed prompting formalize prompt libraries as reusable software modules?
- What alignment procedures cause different models to share the same output distribution?
- Can two agents with identical token counts produce vastly different outputs?
- Can similar outputs from different systems prove they work the same way?
- How do logical forms of prompts influence what language models can derive?
- Why do prompt effects reverse between different model generations?
- What other pragmatic prompt features have unstable effects?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does model confidence predict robustness to prompt changes?
Explores whether a model's certainty about its answer determines how much it resists prompt rephrasing and semantic variation. This matters because it could explain why some tasks are harder to evaluate reliably.
confidence framing complicated by frequency as deeper variable
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
surface dominates when surface and meaning compete
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Adam's Law: Textual Frequency Law on Large Language Models
- Mind Your Tone: Investigating How Prompt Politeness Affects LLM Accuracy (short paper)
- Measuring Faithfulness in Chain-of-Thought Reasoning
- ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
- Large Language Models Are Human-level Prompt Engineers
- Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
- Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge
- Do Prompt-Based Models Really Understand the Meaning of Their Prompts?
Original note title
paraphrase equivalence is a fiction — same-meaning prompts produce different LLM outputs because frequency, not semantics, drives the prediction