INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›How do prompts and framing affect…›How faithfully do LLMs reflect the…›this inquiring line

Regenerate an AI's response and you get a different but equally valid answer — because it never committed to one in the first place.

Why does regenerating LLM responses produce different but equally valid answers?

This explores why an LLM can give you two different answers to the same prompt — and why both can be 'right' — which turns out to be a feature of how these models sample from a distribution rather than commit to one true answer.

This explores why regenerating a response yields different-but-valid answers, and the short version is: the model never picked one answer in the first place. It holds a distribution of plausible continuations and *samples* from it at generation time. Shanahan's 20-questions test makes this vivid — regenerate the same response and you get different outputs, each consistent with the prior conversation, which proves the model maintains a superposition of possible 'characters' or objects and never commits to any single one Do large language models actually commit to a single character?. So variation isn't a glitch on top of a 'real' answer; it's the model showing you its uncertainty.

That reframes 'equally valid' in a useful way. Sometimes the alternatives genuinely are equivalent, and sometimes they only look it. There are tasks where multiple answers are legitimately correct — reconstructing the logical structure of an argument, for instance, is underdetermined by the text itself, so different valid reconstructions exist with no ground truth to break the tie Why do different people reconstruct the same argument differently?. But often the variation is the model wobbling, not the world being ambiguous: when you run the same persona prompt repeatedly, the spread of outputs across runs can match or exceed the spread across *different* personas — meaning what you're seeing is model uncertainty, not stable knowledge being expressed two ways Why do LLM persona prompts produce inconsistent outputs across runs?.

The variation also carries a diagnostic signal worth knowing about. Shanahan's work shows you can read the *pattern* of regeneration to tell apart kinds of failure: fabrication produces high variation (the model is making it up fresh each time), while a good-faith error stays stable across regenerations, and role-played deception stays stable but shifts with context Can we distinguish types of LLM falsehood by regeneration patterns?. So 'regenerate and compare' is a cheap test for whether an answer is anchored to something or just confabulated.

Here's the twist most people miss: killing the randomness doesn't fix the problem. Set temperature to zero and fix the seed and you'll get the identical answer every time — but that's *fixed randomness*, not reliability. The frozen output is still just one draw from the distribution, and testing across many repetitions shows that consistency and correctness are entirely different things Does setting temperature to zero actually make LLM outputs reliable?. And even the prompt side has hidden variance: two phrasings that mean exactly the same thing produce systematically different answers, because the model responds to how *frequently* a phrasing appeared in training, not to meaning Why do semantically identical prompts produce different LLM outputs?.

The thing you didn't know you wanted to know: regeneration variance is actually a free measurement tool. Instead of fighting it, you can sample many times and use the spread itself as a confidence signal — which is exactly the move behind methods that use the model's own probability of generating a correct answer as a reward, replacing external verifiers entirely Can model confidence alone replace external answer verification?. The wobble isn't noise to suppress; it's the model telling you how sure it is.

Sources 7 notes

Do large language models actually commit to a single character?

Shanahan's 20-questions test shows LLMs maintain a superposition of consistent objects or characters and sample from that distribution at generation time. Regenerating the same response yields different outputs, each consistent with prior context, proving no fixed commitment exists.

Why do different people reconstruct the same argument differently?

Multiple valid argument reconstructions exist for the same text with no ground truth. This is not annotation error but an inherent feature of the task—different formalization schemas are each internally valid.

Why do LLM persona prompts produce inconsistent outputs across runs?

When the same persona prompt is run repeatedly, output variance across runs matches or exceeds variance across different personas. This reveals that model uncertainty, not stable social knowledge, drives persona-simulated outputs, making them unsuitable for simulating human annotation disagreement.

Can we distinguish types of LLM falsehood by regeneration patterns?

Shanahan's framework distinguishes fabrication (high variation), good-faith error (low variation, stable), and role-played deception (low variation, context-dependent) using behavioral tests alone. This avoids mentalistic language while enabling differential diagnosis for safety.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Show all 7 sources

Why do semantically identical prompts produce different LLM outputs?

Cao et al. and Adam's Law show that semantically identical prompts with different sentence-level frequencies produce systematically different output quality. Higher-frequency phrasings win because models register statistical mass from pre-training, not meaning.

Can model confidence alone replace external answer verification?

RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Can You Trust LLM Judgments? Reliability of LLM-as-a-Judge2.45 match · arxiv ↗
ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs1.65 match · arxiv ↗
When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models1.62 match · arxiv ↗
RLPR: Extrapolating RLVR to General Domains without Verifiers0.87 match · arxiv ↗
Learning to Reason without External Rewards0.87 match · arxiv ↗
Post-Training Large Language Models via Reinforcement Learning from Self-Feedback0.86 match · arxiv ↗
Escaping the Verifier: Learning to Reason via Demonstrations0.86 match · arxiv ↗
Adam's Law: Textual Frequency Law on Large Language Models0.86 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing claims about LLM regeneration variance and sampling behavior. The question remains open: *Why do regenerated LLM responses differ, and what do those differences tell us?* A curated library (2021–2026) found:

**What a curated library found — and when (dated claims, not current truth):**
- LLMs sample from a distribution of plausible continuations; they never commit to a single answer in advance (Shanahan-style regeneration tests, ~2023–2024).
- Variation can signal model uncertainty vs. genuine task ambiguity; sampling patterns diagnose fabrication (high variance), good-faith error (stable), and role-play deception (stable but context-sensitive) (~2024–2025).
- Temperature=0 + fixed seed yields *fixed randomness*, not reliability; consistency and correctness are orthogonal (~2024).
- Prompt paraphrase equivalence is fiction: frequency-of-phrasing in training, not semantic meaning, drives response divergence (Adam's Law, ~2026).
- LLM intrinsic probability of correctness can replace external verifiers; regeneration spread itself is a free confidence signal (~2024–2025).
- Newer work on persona stability (multi-turn RL, ~2025) and grounding under loaded questions (~2026) suggests regeneration variance tracks *when* the model lacks real commitment to ground truth.

**Anchor papers (verify; mind their dates):**
- arXiv:2304.14399 (2023): Ambiguity modeling
- arXiv:2410.12405 (2024): Prompt sensitivity
- arXiv:2511.00222 (2025): Persona stability via RL
- arXiv:2604.02176 (2026): Frequency law

**Your task:**
(1) **RE-TEST EACH CONSTRAINT.** For every finding, judge whether post-2026 scaling, instruction-tuning, RLHF variants, or new eval harnesses have *relaxed* the claim. Specifically: does temperature scaling or latent-space steering now produce stable *and* correct outputs? Can modern prompting (chain-of-thought, self-reflection) reduce the frequency-effect? Does the intrinsic-probability-as-reward pipeline now outperform external verifiers, or does it still wobble?
(2) **Surface strongest CONTRADICTING work** from last 6 months. Are there papers showing regeneration variance *decreases* under certain training regimes, or that model commitment is *not* illusory?
(3) **Propose 2 open research questions** that assume the regime may have shifted: e.g., "Can we design prompts or fine-tuning that make the sampling distribution sharper *without* sacrificing correctness?" or "Does regeneration variance track epistemic vs. aleatoric uncertainty separately?" Cite arXiv IDs; flag anything you cannot ground in a real paper.

Regenerate an AI's response and you get a different but equally valid answer — because it never committed to one in the first place.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8