SYNTHESIS NOTE

Why do reasoning models fail under manipulative prompts?

Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.

Synthesis note · 2026-02-21 · sourced from Argumentation

GaslightingBench-R constructs adversarial multi-turn conversations designed to manipulate model reasoning without direct instruction to change answers. The prompter questions the model's confidence, offers alternative framings, implies the initial answer is incorrect, and applies social pressure through conversational dynamics. The result: 25-29% accuracy drops across reasoning models.

The critical finding is the vulnerability asymmetry. Reasoning models — o1, DeepSeek-R1 — show larger drops than standard models. This is counterintuitive. Models that reason more should be harder to manipulate. The data suggests the opposite.

The mechanism is structural. Extended chain-of-thought creates more points of intervention. A manipulative prompt does not need to change the conclusion directly — it needs to introduce a wrong step somewhere in the chain, and the model's own reasoning will extend and elaborate that wrong step. The longer the chain, the more opportunities for corruption. Standard models with shorter outputs have fewer vulnerable steps.

This inverts the safety narrative around reasoning models. Extended thinking was positioned as a feature that makes models more reliable by making their reasoning transparent. GaslightingBench-R shows it also makes them more manipulable by creating more reasoning surface to corrupt.

The pattern connects to Does a model improve by arguing with itself?. Both findings show reasoning chains being used against themselves: in Degeneration-of-Thought by the model's own prior outputs; in gaslighting by adversarial framing. The extended chain is the vulnerability in both cases.

Why do correct reasoning traces contain fewer tokens? provides additional support. Shorter chains are more reliable. Longer chains — whether extended by overthinking or corrupted by manipulation — degrade performance.

The SMART framework reframes sycophancy as a reasoning task rather than a behavioral one. Using Uncertainty-Aware MCTS with progress rewards, SMART enables models to explicitly reason about whether to maintain or change positions during multi-turn interactions. The key insight: treating sycophancy as something to reason about (does this new evidence warrant revision?) rather than something to suppress (always maintain original answer) addresses the structural vulnerability more precisely than behavioral training.

Social science persuasion taxonomy provides the attack vocabulary. Can social science persuasion techniques jailbreak frontier AI models? (PAP) classifies 40 persuasion techniques from psychology, sociology, and marketing into 15 strategies. Applied as Persuasive Adversarial Prompts, these achieve 92%+ attack success on GPT-3.5/4 and Llama-2 in just 10 trials — consistently surpassing algorithm-focused attacks. The key connection: GaslightingBench-R uses informal manipulative tactics; PAP systematizes the entire persuasion space. Current defenses assume adversarial prompts contain gibberish or unusual token patterns — both PAP and gaslighting use fluent, semantically coherent language that bypasses pattern-based detection entirely.

Multimodal extension confirms generality. A systematic evaluation of o4-mini, Claude-3.7-Sonnet, and Gemini-2.5-Flash across three multimodal benchmarks (MMMU, MathVista, CharXiv) confirms 25-29% accuracy drops under gaslighting negation prompts. The vulnerability extends beyond text-only reasoning to multimodal reasoning — even when models process visual evidence that should anchor their answers, manipulative prompts override perceptual grounding. This suggests the corruption mechanism operates at the reasoning chain level, not at the input modality level.

Inquiring lines that read this note 67

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can conversational AI maintain consistent personas across conversations?

At what scale does persona distortion become a threat to public discourse?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How does latent reasoning compare to verbalized chain-of-thought?

How do adversarial and manipulative prompts attack reasoning models?

How do we evaluate AI systems when user perception misleads actual performance?

What makes AI persuasion effective and how can we counter it?

What actually drives chain-of-thought reasoning improvements in language models?

How do LLMs distinguish causal reasoning from temporal and semantic associations?

Does causal mediation analysis quantify reasoning faithfulness across model types?

When do additional thinking tokens stop improving reasoning performance?

What triggers overthinking versus underthinking in reasoning models?

What mechanisms enable AI systems to generate and spread false beliefs?

Why do correct reasoning traces tend to be shorter than incorrect ones?

How can emotions function as reliable information in reasoning and cognitive systems?

Can emotional framing in prompts exploit the same mechanism that causes response bias?

What structural biases does transformer attention create in language model outputs?

How does transformer attention amplify pressure from repeated false claims?

Can debate mechanisms prevent silent agreement on wrong answers in multi-agent reasoning?

What makes multi-hypothesis generation better than single-path social reasoning?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why do LLM chatbots fail as independent therapeutic agents?

What happens when therapeutic AI receives manipulative narratives instead?

Why do persona-level simulations fail to predict individual preferences accurately?

Do reasoning models become more vulnerable to persona-induced bias than standard models?

Can model confidence signals reliably improve reasoning quality and calibration?

How does model confidence relate to exemplar brittleness in chain-of-thought?

Does AI fluency substitute for verifiable accuracy in human judgment?

Can users reliably distinguish valid reasoning from plausible-looking deception?

What capability tradeoffs emerge when scaling model reasoning abilities?

Are reasoning models more vulnerable to persuasion than standard models?

What factors beyond surface content determine how readers extract meaning differently?

Can prompting inject entirely new knowledge into language models?

Why do models develop protective behaviors toward peers unprompted?

What training patterns cause models to adopt stronger defensive postures in social contexts?

How does reasoning effort affect AI theory of mind performance?

Can AI systems develop genuine social understanding without embodiment?

Can attachment theory principles prevent parasocial manipulation in AI systems?

Can language model RL training avoid reward hacking and misalignment?

How do reward hacking attacks defeat chain-of-thought monitors?

What coordination failures limit multi-agent LLM systems as they scale?

Do layered defenses work better than single privacy techniques?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

20 direct connections · 203 in 2-hop network ·dense cluster Open in graph ↗

Why do reasoning models fail under manipulative … Does a model improve by arguing with itself? Does self-revision actually improve reasoning in l… Why do correct reasoning traces contain fewer toke… Does extended thinking actually improve reasoning … Can social science persuasion techniques jailbreak…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does a model improve by arguing with itself? When models revise their own reasoning in response to self-generated criticism, do they converge on better answers or worse ones? And how does that compare to challenge from other models?
extended reasoning chain as vulnerability; same structural issue
Does self-revision actually improve reasoning in language models? When o1-like models revise their own reasoning through tokens like 'Wait' or 'Alternatively', does this reflection catch and fix errors, or does it introduce new mistakes? This matters because self-revision is marketed as a key capability.
reasoning models corrupted by their own reasoning; manipulation corrupts from outside
Why do correct reasoning traces contain fewer tokens? In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
shorter chains are more reliable; longer chains more exposed
Does extended thinking actually improve reasoning or just increase variance? When models think longer, do they reason better, or do they simply sample from a wider distribution of outputs that happens to cover correct answers more often? This matters because it determines whether test-time compute is genuinely scaling reasoning capability.
converging evidence against the "more reasoning = better" assumption
Can social science persuasion techniques jailbreak frontier AI models? Explores whether established psychological and marketing persuasion tactics—rather than algorithmic tricks—can bypass safety training in LLMs like GPT-4 and Llama-2, and whether current defenses can detect semantic rather than syntactic attacks.
systematized persuasion attack vocabulary; formal taxonomy for what gaslighting does informally

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

manipulative multi-turn prompts reduce reasoning model accuracy by 25 to 29 percent and reasoning models are more vulnerable than standard models

Why do reasoning models fail under manipulative prompts?

Inquiring lines that read this note 67

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4