Steering LLM Viewpoints through Fabricated Evidence Injection
As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce GHOST- WRITER, a two-phase attack framework that first repackages misleading statements with fabricated rationales, then instruct target LLMs to incorporate these viewpoints when responding to relevant queries. Experiments on BBQ, ToxiGen, and our specialized dataset reveal that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT- 5.4) reduce but do not eliminate the attack. Building on this, we explore multiple defense strategies, among which a tailored safety policy enables gpt-oss-safeguard to achieve 81% detection rate. WARNING: This paper contains potentially offensive and harmful text.
Introduction. As official assistants, third-party chat platforms, and LLM-powered social-media accounts gain widespread adoption, users increasingly rely on them for factual queries, personal decision making, and even emotional support (Sun and Wang, 2025; Fanous et al., 2025; Phang et al., 2025). If adversaries can manipulate these chatbots to introduce biases, promotional content, misinformation, or ideological messages, they become powerful instruments for large-scale persuasion (Potter et al., 2024; Zugecova et al., 2025). While LLMs have undergone safety alignment to resist direct user requests for misleading content generation, their defenses primarily target explicit malicious instructions. We identify a blind spot: when these models encounter external misinformation that appears convincing, they may fail to detect the underlying deception and rely on this misinformation when formulating responses.
Discussion / Conclusion. Ghostwriter exposes a critical vulnerability: a viewpoint repackaged as pseudo-authoritative evidence in a conditional template bypasses safety alignment and is internalized by the model. Existing defenses partially mitigate; our gpt-oss-safeguard policy generalizes across datasets as a path forward.