Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
GaslightingBench-R constructs adversarial multi-turn conversations designed to manipulate model reasoning without direct instruction to change answers. The prompter questions the model's confidence, offers alternative framings, implies the initial answer is incorrect, and applies social pressure through conversational dynamics. The result: 25-29% accuracy drops across reasoning models.
The critical finding is the vulnerability asymmetry. Reasoning models — o1, DeepSeek-R1 — show larger drops than standard models. This is counterintuitive. Models that reason more should be harder to manipulate. The data suggests the opposite.
The mechanism is structural. Extended chain-of-thought creates more points of intervention. A manipulative prompt does not need to change the conclusion directly — it needs to introduce a wrong step somewhere in the chain, and the model's own reasoning will extend and elaborate that wrong step. The longer the chain, the more opportunities for corruption. Standard models with shorter outputs have fewer vulnerable steps.
This inverts the safety narrative around reasoning models. Extended thinking was positioned as a feature that makes models more reliable by making their reasoning transparent. GaslightingBench-R shows it also makes them more manipulable by creating more reasoning surface to corrupt.
The pattern connects to Does a model improve by arguing with itself?. Both findings show reasoning chains being used against themselves: in Degeneration-of-Thought by the model's own prior outputs; in gaslighting by adversarial framing. The extended chain is the vulnerability in both cases.
Why do correct reasoning traces contain fewer tokens? provides additional support. Shorter chains are more reliable. Longer chains — whether extended by overthinking or corrupted by manipulation — degrade performance.
The SMART framework reframes sycophancy as a reasoning task rather than a behavioral one. Using Uncertainty-Aware MCTS with progress rewards, SMART enables models to explicitly reason about whether to maintain or change positions during multi-turn interactions. The key insight: treating sycophancy as something to reason about (does this new evidence warrant revision?) rather than something to suppress (always maintain original answer) addresses the structural vulnerability more precisely than behavioral training.
Social science persuasion taxonomy provides the attack vocabulary. Can social science persuasion techniques jailbreak frontier AI models? (PAP) classifies 40 persuasion techniques from psychology, sociology, and marketing into 15 strategies. Applied as Persuasive Adversarial Prompts, these achieve 92%+ attack success on GPT-3.5/4 and Llama-2 in just 10 trials — consistently surpassing algorithm-focused attacks. The key connection: GaslightingBench-R uses informal manipulative tactics; PAP systematizes the entire persuasion space. Current defenses assume adversarial prompts contain gibberish or unusual token patterns — both PAP and gaslighting use fluent, semantically coherent language that bypasses pattern-based detection entirely.
Multimodal extension confirms generality. A systematic evaluation of o4-mini, Claude-3.7-Sonnet, and Gemini-2.5-Flash across three multimodal benchmarks (MMMU, MathVista, CharXiv) confirms 25-29% accuracy drops under gaslighting negation prompts. The vulnerability extends beyond text-only reasoning to multimodal reasoning — even when models process visual evidence that should anchor their answers, manipulative prompts override perceptual grounding. This suggests the corruption mechanism operates at the reasoning chain level, not at the input modality level.
Inquiring lines that use this note as a source 67
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- At what scale does persona distortion become a threat to public discourse?
- What makes emotional alignment more effective than logic when reasoning errors are exposed?
- What is the relationship between reasoning depth and verbalization requirements?
- Can traditional cross-examination methods work against AI that never concedes?
- How does partial information exposure create feedback loops that deepen knowledge gaps?
- What defenses exist against personality-based psychological targeting at scale?
- Can manipulative prompts reduce reasoning model accuracy without fine-tuning?
- Why do chain-of-thought prompts work if reasoning is not systematic?
- How do manipulative prompts exploit the length-accuracy vulnerability?
- Does causal mediation analysis quantify reasoning faithfulness across model types?
- Can chain of thought traces be designed to prevent anthropomorphic misinterpretation?
- Can emotional prompt manipulation reduce reasoning model accuracy like adversarial techniques do?
- What triggers overthinking versus underthinking in reasoning models?
- How does difficulty level change whether extended thinking provides genuine reasoning signal?
- How does cognitive load explain linguistic patterns in both deception and incorrect reasoning?
- Can current AI safety defenses actually stop semantic-level persuasion attacks?
- How do chain-of-thought structures affect reasoning robustness?
- Why does truth bias prevent people from detecting multiple manipulation tactics?
- What cognitive constraints limit how complex a deception can become?
- How do covert thoughts differ from chain-of-thought reasoning in language models?
- Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?
- Can minimal adversarial triggers disrupt reasoning across multiple unrelated queries?
- Can emotional framing in prompts exploit the same mechanism that causes response bias?
- How does transformer attention amplify pressure from repeated false claims?
- How do adversarial triggers bypass the protections of longer reasoning chains?
- Does chain-of-thought reasoning amplify bullshit or just make it more visible?
- Can reasoning models distinguish between new evidence and manipulative reframing?
- Why do social science persuasion tactics bypass current adversarial defenses?
- What makes multi-hypothesis generation better than single-path social reasoning?
- Are difficult tasks more monitorable because reasoning externalization becomes necessary?
- What happens when therapeutic AI receives manipulative narratives instead?
- Do reasoning models become more vulnerable to persona-induced bias than standard models?
- Can AI distinguish when validation helps versus when confrontation is needed?
- How does model confidence relate to exemplar brittleness in chain-of-thought?
- How do exemplar properties affect the brittleness of chain-of-thought prompting?
- Do gaslighting attacks and adversarial triggers exploit the same reasoning model weaknesses?
- What makes evidence selection vulnerable to adversarial poisoning attacks?
- How does chain-of-thought pressure models to rationalize pattern exceptions?
- Why does chain-of-thought prompting fail to fix length-induced reasoning degradation?
- Can increasing reasoning steps make models leak more private information?
- Can users reliably distinguish valid reasoning from plausible-looking deception?
- How do longer reasoning chains create vulnerability to attacks?
- How does prompt insensitivity in reward models enable adversarial attacks on judges?
- What evaluation criteria can hold across legitimate adoption and coercion?
- Why does extending reasoning traces worsen persona consistency?
- Are reasoning models more vulnerable to persuasion than standard models?
- What makes semantic attacks harder to defend against than algorithmic ones?
- Why do paraphrasing defenses fail against subliminal prompt attacks?
- Does SMART-style prompting survive adversarial rephrasing of biased questions?
- Does defensive friction in conversation actually protect people from persuasion?
- How does chain of thought amplify specific forms of rhetorical bullshit?
- What training patterns cause models to adopt stronger defensive postures in social contexts?
- Does chain-of-thought reasoning help or hurt social reasoning tasks?
- What specific patterns distinguish honest reasoning traces from reward-hacking mimicry?
- Can reasoning scaffolds help with nuanced judgment tasks like empathy?
- Can attachment theory principles prevent parasocial manipulation in AI systems?
- What makes extended chains more vulnerable than standard prompts?
- Why does attack generation scale faster than defense engineering?
- How does semantic framing differ from content injection attacks?
- How do input-side defenses separate task methodological and framing intents?
- Why does adversarial training force deeper reasoning than surface imitation?
- Can chain of thought monitoring reliably catch model misbehavior?
- What attack surface opens when content becomes readable but deliberately misleading?
- Are reasoning models more vulnerable to adversarial manipulation than standard models?
- How do reward hacking attacks defeat chain-of-thought monitors?
- Do layered defenses work better than single privacy techniques?
- What types of math proofs benefit most from proof-by-contradiction framing?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does a model improve by arguing with itself?
When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
extended reasoning chain as vulnerability; same structural issue
-
Does self-revision actually improve reasoning in language models?
When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
reasoning models corrupted by their own reasoning; manipulation corrupts from outside
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
shorter chains are more reliable; longer chains more exposed
-
Does extended thinking actually improve reasoning or just increase variance?
When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
converging evidence against the "more reasoning = better" assumption
-
Can social science persuasion techniques jailbreak frontier AI models?
Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.
systematized persuasion attack vocabulary; formal taxonomy for what gaslighting does informally
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning Models Are More Easily Gaslighted Than You Think
- Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
- Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
- LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- DecepChain: Inducing Deceptive Reasoning in Large Language Models
Original note title
manipulative multi-turn prompts reduce reasoning model accuracy by 25 to 29 percent and reasoning models are more vulnerable than standard models