Why does regenerating LLM responses produce different but equally valid answers?
This explores why an LLM can give you two different answers to the same prompt — and why both can be 'right' — which turns out to be a feature of how these models sample from a distribution rather than commit to one true answer.
This explores why regenerating a response yields different-but-valid answers, and the short version is: the model never picked one answer in the first place. It holds a distribution of plausible continuations and *samples* from it at generation time. Shanahan's 20-questions test makes this vivid — regenerate the same response and you get different outputs, each consistent with the prior conversation, which proves the model maintains a superposition of possible 'characters' or objects and never commits to any single one Do large language models actually commit to a single character?. So variation isn't a glitch on top of a 'real' answer; it's the model showing you its uncertainty.
That reframes 'equally valid' in a useful way. Sometimes the alternatives genuinely are equivalent, and sometimes they only look it. There are tasks where multiple answers are legitimately correct — reconstructing the logical structure of an argument, for instance, is underdetermined by the text itself, so different valid reconstructions exist with no ground truth to break the tie Why do different people reconstruct the same argument differently?. But often the variation is the model wobbling, not the world being ambiguous: when you run the same persona prompt repeatedly, the spread of outputs across runs can match or exceed the spread across *different* personas — meaning what you're seeing is model uncertainty, not stable knowledge being expressed two ways Why do LLM persona prompts produce inconsistent outputs across runs?.
The variation also carries a diagnostic signal worth knowing about. Shanahan's work shows you can read the *pattern* of regeneration to tell apart kinds of failure: fabrication produces high variation (the model is making it up fresh each time), while a good-faith error stays stable across regenerations, and role-played deception stays stable but shifts with context Can we distinguish types of LLM falsehood by regeneration patterns?. So 'regenerate and compare' is a cheap test for whether an answer is anchored to something or just confabulated.
Here's the twist most people miss: killing the randomness doesn't fix the problem. Set temperature to zero and fix the seed and you'll get the identical answer every time — but that's *fixed randomness*, not reliability. The frozen output is still just one draw from the distribution, and testing across many repetitions shows that consistency and correctness are entirely different things Does setting temperature to zero actually make LLM outputs reliable?. And even the prompt side has hidden variance: two phrasings that mean exactly the same thing produce systematically different answers, because the model responds to how *frequently* a phrasing appeared in training, not to meaning Why do semantically identical prompts produce different LLM outputs?.
The thing you didn't know you wanted to know: regeneration variance is actually a free measurement tool. Instead of fighting it, you can sample many times and use the spread itself as a confidence signal — which is exactly the move behind methods that use the model's own probability of generating a correct answer as a reward, replacing external verifiers entirely Can model confidence alone replace external answer verification?. The wobble isn't noise to suppress; it's the model telling you how sure it is.
Sources 7 notes
Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.
Multiple valid argument reconstructions exist for the same text with no ground truth. This is not annotation error but an inherent feature of the task—different formalization schemas are each internally valid.
When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.
Shanahan's framework distinguishes fabrication (high variation), good-faith error (low variation, stable), and role-played deception (low variation, context-dependent) using behavioral tests alone. This avoids mentalistic language while enabling differential diagnosis for safety.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.
RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.