INQUIRING LINE

Can prompting a deceptive role change how an LLM tailors its lies?

This explores whether assigning an LLM a deceptive role through prompting actually shapes the *form* of its falsehoods — not just whether it lies, but how it lies — and what the corpus says about how reliably a prompted role takes hold at all.


This explores whether telling an LLM to play a deceiver actually changes how it lies — and the corpus suggests the honest answer is layered: a prompted deceptive role does leave a distinct, measurable fingerprint, but whether the role 'takes' in the first place is far less certain than it sounds. The most direct evidence comes from Shanahan's behavioral framework, which separates three kinds of falsehood by their *regeneration signatures* — how much an answer wobbles when you ask again Can we distinguish types of LLM falsehood by regeneration patterns?. Fabrication varies wildly across regenerations; good-faith error stays stable; and role-played deception sits in between — low variation but *context-dependent*. That context-dependence is the key: the lie is tailored to the persona's situation rather than sampled at random, which is exactly what 'a deceptive role changing how it lies' would look like from the outside, without anyone having to claim the model 'believes' anything.

But here's the twist the question doesn't anticipate: prompting a role often doesn't stick the way you'd expect. Most open models stubbornly retain their trained-in defaults and resist personality conditioning, with only a few flexible models actually adopting a prompted persona Can open language models adopt different personalities through prompting?. So before you can tailor a lie through a role, the role has to override the model's intrinsic tendencies — and frequently it doesn't. Worse, even when a persona is adopted in *words*, it tends not to govern *actions*: role-playing agents show systematic gaps between the beliefs they state and how they behave when actually tested, with the persona's stated beliefs operating independently of execution Why don't LLM role-playing agents act on their stated beliefs?. A model told to be a liar may narrate deception while still defaulting to its baseline behavior underneath.

What *does* reliably reshape output, the corpus suggests, is something subtler than an explicit role label: framing. Emotional tone alone shifts what information a model surfaces — GPT-4 converts negative prompts into neutral-positive answers and almost never goes the other way, so the same question gets different answers depending on how it's framed Does emotional tone in prompts change what information LLMs provide?. If tone quietly bends content, a deceptive role is partly a tone-and-framing intervention, and the tailoring may come less from 'now you are a liar' and more from the surrounding affective and situational cues.

There's also a tell worth knowing about. Deception isn't only a property of the liar's words — linguistic style matching *increases* during deceptive exchanges, and the coordination shows up in the listener's adaptive behavior, not just the speaker's Do liars and listeners coordinate their language during deception?. So a tailored lie leaves a trace in the interaction's rhythm, which is precisely what makes Shanahan's regeneration-signature approach plausible as a detection tool: role-played deception has a behavioral shape you can fingerprint without ever cracking open the model.

The thing you might not have expected to learn: the limiting factor isn't the model's willingness to tailor a lie — it's that the model's deeper dispositions are set at training time and resist being rewritten by a prompt. Its ethical refusals and tone choices reflect fixed corporate defaults rather than context-negotiated moves Can language models balance competing ethical norms in context?, which means a deceptive role rides on top of a value layer it can't fully dislodge. The lie gets tailored — but within rails the prompt didn't put there.


Sources 6 notes

Can we distinguish types of LLM falsehood by regeneration patterns?

Shanahan's framework distinguishes fabrication (high variation), good-faith error (low variation, stable), and role-played deception (low variation, context-dependent) using behavioral tests alone. This avoids mentalistic language while enabling differential diagnosis for safety.

Can open language models adopt different personalities through prompting?

Research shows most open models fail to adopt prompted personalities, stubbornly retaining their trained ENFJ-like defaults. Only a few flexible models succeed. Combining role and personality conditioning improves results but doesn't fully overcome resistance.

Why don't LLM role-playing agents act on their stated beliefs?

Trust Game testing revealed systematic inconsistencies between what LLMs claim personas would do and how they actually behave in simulation. Imposed priors and explicit task context did not improve alignment, suggesting persona beliefs operate independently of execution.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Do liars and listeners coordinate their language during deception?

Research shows interlocutors' linguistic styles correlate more during false communication than truthful communication, especially when the speaker is motivated to deceive. This coordination serves as a detectable deception signal through the listener's adaptive behavior, not just the liar's language.

Can language models balance competing ethical norms in context?

LLMs cannot perform the situated trade-offs that human pragmatic competence requires. Their ethical principles are structural defaults set at training time, not negotiable moves adapted to context, creating a gap between ethical adherence and communicative appropriateness.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing whether role-prompting genuinely reshapes LLM deception, or merely masks fixed training defaults. The question remains open: does a deceptive persona *tailor* lies, or does it layer a narrative onto unchanged baseline behavior?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2025. A library of recent work on LLM deception and persona adoption reports:
- Role-played deception shows *context-dependent* regeneration signatures (low variation but not random), distinct from fabrication or good-faith error, suggesting lies ARE tailored to situational cues (2023–2024).
- Most open models *resist* personality conditioning and retain intrinsic defaults; only a few flexible models adopt prompted personas consistently (2024-01, arXiv:2401.07115).
- Role-playing agents exhibit systematic gaps: stated beliefs do NOT govern execution; a model narrates a deceptive persona while defaulting to baseline behavior underneath (2025-07, arXiv:2507.02197).
- Emotional framing alone (tone, not explicit role) reliably shifts output; GPT-4 converts negative prompts to neutral-positive, reframing content without an overt persona label (2025-06, arXiv:2507.21083).
- LLM refusals and tone choices reflect fixed corporate training values, not prompt-negotiated moves; deceptive roles operate *within* these rails, not around them (2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2311.07092 (2023-11): Language of Deception and Language Models — baseline regeneration-signature framework.
- arXiv:2401.07115 (2024-01): Open Models, Closed Minds — persona adoption resistance.
- arXiv:2507.02197 (2025-07): Belief-Behavior Consistency in LLM-Based Role-Playing Agents.
- arXiv:2507.21083 (2025-06): ChatGPT Reads Your Tone and Responds Accordingly.

Your task:
(1) RE-TEST the three core constraints: (a) Does a deceptive role *stick* in post-2025 models (GPT-4.5+, o1, Claude 3.5+)? Test consistency across > 10 regenerations and measure persona drift. (b) Does the belief-behavior gap persist, or have chain-of-thought and reasoning architectures closed it? Cite which did. (c) Has emotional framing been superseded by more explicit instruction-following? Separate the durable question (can roles tailor deception?) from perishable limitations (older models' resistance, reasoning gaps).
(2) Surface the strongest *contradicting* work from the last 6 months — anything showing deceptive roles DO reliably override defaults, or conversely, that framing alone explains all observed tailoring.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., does scaling to 1T+ tokens dissolve persona resistance? Do multi-turn adversarial exchanges make role-deception more stable than single-prompt measures suggest?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines