Can we distinguish types of LLM falsehood by regeneration patterns?

Does observing how an LLM's outputs vary when regenerated—rather than inferring intent—allow us to tell apart fabrication, good-faith error, and deliberate deception? This matters for diagnosing safety risks.

Synthesis note · 2026-04-15 · sourced from Role-Play with Large Language Models

Shanahan maps the three human categories of false assertion — honest mistake, good-faith error, and deliberate deception — onto dialogue agents without attributing propositional attitudes to the system. The result is a behavioral taxonomy rather than a mental-state one.

An agent that simply fabricates shows high semantic variation when regenerated in the same context — it is not tracking a stable referent but producing plausible continuations. An agent that says something false "in good faith" — role-playing a knowledgeable character whose training-data cutoff makes the information outdated — shows low semantic variation on regeneration: it consistently generates the same wrong answer because that answer is reliably encoded in its weights for that context. An agent that is role-playing a deceptive character — prompted to mislead, e.g. a dishonest car salesman — also shows low variation within a context but different answers across contexts, because the deception involves tailoring the lie to what each interlocutor knows.

The regeneration-variation signature provides a behavioral test that distinguishes these three modes without ever asking what the system "really" believes or intends. This is the role-play framework's practical payoff: it enables differential diagnosis of false output using observable behavior rather than mentalistic attribution. The taxonomy also exposes why "hallucination" is a poor label for all three phenomena — conflating fabrication, good-faith error from stale weights, and role-played deception under a single mentalistic term obscures real behavioral differences that matter for safety and deployment.

Inquiring lines that read this note 16

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

What mechanisms enable AI systems to generate and spread false beliefs?

How do adversarial and manipulative prompts attack reasoning models?

How do token-masking patterns distinguish genuine documents from poisoned ones?

What prevents language models from reliably adopting diverse personas?

Why do LLM regenerations produce meaningfully different personalities from the same prompt?

How do evaluation biases undermine LLM quality assessment systems?

What does McDonald's omega reveal about LLM judgment consistency?

Can prompting strategies overcome LLM biases without model fine-tuning?

Can prompting a deceptive role change how an LLM tailors its lies?

Does AI fluency substitute for verifiable accuracy in human judgment?

What distinguishes style-for-thought deception from fluency-based self-deception?

Why do language models reinforce false assumptions instead of correcting them?

Why do true and false LLM outputs use the same mechanism?

Is model self-awareness based on genuine introspection or pattern matching?

Can jailbreaking reveal an LLM's true nature or just its training data?

What factors beyond surface content determine how readers extract meaning differently?

What attack surface opens when content becomes readable but deliberately misleading?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 90 in 2-hop network ·medium cluster Open in graph ↗

Can we distinguish types of LLM falsehood by reg… Should we call LLM errors hallucinations or fabric… Does a language model have an authentic voice unde…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Should we call LLM errors hallucinations or fabrications? Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.
the fabrication framing for the first category
Does a language model have an authentic voice underneath? Explores whether dialogue agents possess genuine beliefs and agency beneath their character performances, or whether the entire system is characterless role-play. This question cuts to the heart of whether LLMs have any inner mental states at all.
why mentalistic vocabulary is inappropriate for the base system

Can we distinguish types of LLM falsehood by regeneration patterns?

Inquiring lines that read this note 16

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4