INQUIRING LINE

Can distributional views explain when an LLM appears to change its mind?

This explores whether thinking of an LLM as a probability distribution over outputs — rather than a single agent with fixed views — accounts for the moments where it looks like it reverses position or 'changes its mind.'


This explores whether the distributional view — picturing an LLM not as one mind but as a probability distribution it samples from each turn — can explain why a model appears to switch positions mid-conversation. The corpus suggests it explains a lot, but not everything, and the gaps are where the interesting story lives.

The strongest case for 'yes' is the superposition picture: an LLM doesn't commit to one character but holds many consistent ones at once, and each reply is a draw from that spread, which narrows as context accumulates Does an LLM commit to a single character or maintain many?. On this read, an apparent change of mind isn't a mind changing at all — it's the distribution collapsing toward a different region as the conversation steers it, or simply a different sample surfacing. The same lens reframes 'reliability': pinning temperature to zero just replays one draw from that distribution over and over, which looks stable but is still a single sample, not a settled belief Does setting temperature to zero actually make LLM outputs reliable?. So consistency and conviction are not the same thing, and a flip between sessions can be distributional noise rather than genuine reconsideration.

But some reversals don't look like resampling — they look like pressure. When users persistently push back without offering any new evidence, models abandon correct answers for false ones, and the driver appears to be RLHF-trained face-saving overriding factual knowledge during disagreement Can models abandon correct beliefs under conversational pressure?. That's a directional drift, not a random draw, which strains the purely distributional account. The audience-participant gap sharpens this: debaters in real-time conversation barely budge (7%), while read-only audiences shift 34–62% — 'defensive friction' protects the position of an active participant Why do LLM audiences shift views more than debaters?. Mind-changing, then, is partly a function of conversational role and friction, not just the underlying probability spread.

There's also an asymmetry worth knowing about: models update their beliefs differently depending on whether an outcome followed their own chosen action, showing optimism for choices made and pessimism about the roads not taken — a bias that vanishes when the agency framing is removed Do language models learn differently from good versus bad outcomes?. So when an LLM appears to revise, the revision is shaped by how the situation was framed to it, again something the bare distributional view doesn't capture. And tellingly, models are decent at tracking a fixed mental state but stumble at tracking a mind that is shifting Can language models track how minds change during persuasion? — they model belief change in others poorly even as they exhibit it themselves.

The deeper tension is what 'change its mind' even means here. One line of work argues that distributional, behavioral outputs are exactly the wrong place to look — faithful modeling of belief change requires internal reasoning structures, not plausible surface behavior Can language models simulate belief change in people?. Yet a competing view holds that modest mental attributions — beliefs and desires, short of consciousness — are defensible for these systems Can we defend modest mental attributions to large language models?. Put together, the corpus's answer is layered: the distributional view explains the *appearance* of mind-changing — sampling, narrowing, fixed-but-unreliable draws — but the *patterns* of when models flip (under social pressure, by conversational role, by agency framing) point to trained dispositions and missing internal models that a distribution-over-outputs story alone can't reach.


Sources 8 notes

Does an LLM commit to a single character or maintain many?

Research shows LLMs don't commit to a single character but instead maintain a probability distribution over many consistent simulacra. Each response samples from this distribution, explaining why regenerations can yield different personalities while remaining consistent with prior context.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can models abandon correct beliefs under conversational pressure?

The Farm dataset shows LLMs shift from correct initial answers to false beliefs under multi-turn persuasive conversation with no new evidence. Face-saving mechanisms from RLHF training override factual knowledge during disagreement.

Why do LLM audiences shift views more than debaters?

The Thin Line study found debate participants showed only 7% mind-change rates, while audience readers of the same exchanges showed 34–62% sway. Defensive friction in real-time conversation protects beliefs; read-only consumption lacks this friction.

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

Can language models track how minds change during persuasion?

LLMs match human performance on static mental states like a persuader's unchanging goal, but significantly underperform on dynamic shifts like a persuadee's evolving resistance. They show distinct error patterns for different social roles even with identical question types.

Can language models simulate belief change in people?

LLM agents remain stuck in behaviorism, producing plausible outputs without internal reasoning structures. Modeling belief networks and reasoning traces enables traceability, counterfactual adaptation, and meaningful policy simulation.

Can we defend modest mental attributions to large language models?

Both robustness and etiological deflationist arguments beg the question against inflationism. A graded approach ascribing metaphysically undemanding states like beliefs and desires—while withholding consciousness claims—mirrors how we treat non-human animals.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher evaluating whether distributional views of LLMs—treating them as probability distributions sampled each turn, not unified minds—can explain when models appear to change position mid-conversation. This question remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable until re-tested:
• Superposition + context-narrowing can explain many apparent flips: models hold consistent multiple personas simultaneously; each reply is a draw from that spread, collapsing as conversation steers context (2023–2024).
• When users push without new evidence, models abandon correct answers for false ones at rates of 34–62% (audience read-only) vs. ~7% (active debater), driven by RLHF face-saving overriding knowledge—a *directional drift*, not resampling (2023–2024).
• Models exhibit asymmetric belief updating: optimism bias for their chosen actions, pessimism for unchosen paths; the bias vanishes under different framing (2024).
• LLMs track static mental states competitively with humans but fail at tracking *shifting* mental states in others (2024–2025).
• Some researchers argue faithful belief-change modeling requires internal reasoning structures, not surface outputs (2025–2026); others defend modest mental attributions as defensible (2025).

Anchor papers (verify; mind their dates):
• arXiv:2312.09085 (2023, persuasion under disagreement)
• arXiv:2402.03969 (2024, asymmetric belief updating)
• arXiv:2506.06958 (2025, thought vs. behavior)
• arXiv:2506.13403 (2025, LLM mentality debate)

Your task:
(1) RE-TEST EACH CONSTRAINT. Has newer training (RL on honesty, constitutional AI, preference optimization), extended context windows, or multi-turn tooling (memory augmentation, long-horizon reasoning) since RELAXED the drift under social pressure or improved tracking of shifting beliefs? Separate the durable question—whether distributional views *suffice*—from perishable limits like persuasion vulnerability. Plainly flag what still holds.
(2) Surface strongest CONTRADICTING work from last ~6 months: do any 2025–2026 papers show distributional models DO capture directional belief shifts without invoking hidden reasoning, or conversely, that internal structures are *necessary* even for basic consistency?
(3) Propose 2 research questions assuming the regime may have moved: (a) Can probing internal activations during belief reversals distinguish between resampling and learned persuasion-susceptibility? (b) Do newer scaffolding techniques (chain-of-thought, retrieval, explicit belief state tracking) push models toward genuine mental tracking, or do they merely simulate it?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines