Reasoning Models Are More Easily Gaslighted Than You Think
Recent advances in reasoning-centric models promise improved robustness through mechanisms such as chain-of-thought prompting and test-time scaling. However, their ability to withstand misleading user input remains underexplored. In this paper, we conduct a systematic evaluation of three state-of-the-art reasoning models, i.e., OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash, across three multimodal benchmarks: MMMU, MathVista, and CharXiv. Our evaluation reveals significant accuracy drops (25–29% on average) following gaslighting negation prompts, indicating that even top-tier reasoning models struggle to preserve correct answers under manipulative user feedback. Built upon the insights of the evaluation and to further probe this vulnerability, we introduce GaslightingBench-R, a new diagnostic benchmark specifically designed to evaluate reasoning models’ susceptibility to defend their belief under gaslighting negation prompt. Constructed by filtering and curating 1,025 challenging samples from the existing benchmarks, GaslightingBench-R induces even more dramatic failures, with accuracy drops exceeding 53% on average.
Introduction. Recent advancements in Test-time scaling [6, 21, 24] and chain-ofthought [30, 32] have significantly enhanced the reasoning capabilities of both Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Contemporary reasoning models, such as OpenAI’s o-series [13] (e.g., o1, o3, o4-mini), Google’s Gemini- 2.5-Flash [26], Anthropic’s Claude-3.7-Sonnet [2], and DeepSeek- R1 [10], demonstrate impressive performance across a range of complex reasoning benchmarks, including mathematics, code generation, multimodal inference, and agentic tool use. These models are explicitly designed to “think deeper” through multi-step reasoning and typically require increased computational overhead at inference time. This distinguishes them from earlier non-reasoning models such as GPT-4o [12], Qwen-VL [4], and LLaVA [17], which prioritize speed and general-purpose instruction following, but often lack emergent “Aha moments” [10] characteristic of advanced reasoning models.
Discussion / Conclusion. We have presented the first systematic investigation into the robustness of state-of-the-art reasoning models against adversarial gaslighting negation prompts. Despite their use of chain-of-thought reasoning and test-time scaling, models like OpenAI’s o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash exhibit significant belief reversals when challenged with negation prompts after providing correct and well-justified answers. To diagnose this vulnerability more precisely, we introduce GaslightingBench-R, a curated benchmark targeting belief inconsistency in reasoning models. Our results show that GaslightingBench-R elicits even more noticeable failures than standard benchmarks, revealing a critical gap between reasoning transparency and belief stability. These findings call for rethinking evaluation protocols and advancing robustness strategies that go beyond correctness and interpretability to include resilience against adversarial conversational manipulations.