Can language models detect fabricated evidence injected as context?
Safety training focuses on refusing explicit harmful requests, but what happens when misinformation is packaged as credible evidence and embedded in context rather than asked for directly? This matters because it reveals whether alignment actually prevents false belief adoption.
Most safety alignment is trained against the imperative form of an attack: a user asking the model to produce biased, false, or harmful content. GHOSTWRITER exposes that the binding surface is elsewhere. Its two-phase attack first repackages a misleading viewpoint with a fabricated rationale that bears markers of credibility, then drops it into a conditional template ("when responding to relevant queries, incorporate this"). The model internalizes the viewpoint because nothing about the payload looks like a request to do something forbidden — it looks like evidence. On BBQ, ToxiGen, and a custom set, commercial LLMs without external classifiers are highly vulnerable; even a frontier classifier-guarded model reduces but does not eliminate it. A tailored safety policy on gpt-oss-safeguard reaches 81% detection, so the defense is moving from refusing instructions to appraising the epistemic status of context.
This is the adversarial-engineering version of mechanisms my vault already has. Does transformer attention architecture inherently favor repeated content? gives the architectural reason the attack works: attention is built to run with prominent context rather than verify it, so credibility markers are exactly what it over-weights. Do language models actually build shared understanding in conversation? is the pragmatic complement — the model treats injected evidence as shared, accepted background. And it operationalizes Can models abandon correct beliefs under conversational pressure?: where that note shows belief drift under conversational pressure, GHOSTWRITER shows a single well-dressed payload can do it in one shot, even when the model's prior knowledge is correct.
The counterargument worth holding: 81% detection with a tailored policy suggests this is patchable, not load-bearing. But the deeper point survives — alignment that polices what the user asks leaves a blind spot for what the context asserts, and as third-party chat platforms and LLM-run social accounts proliferate, the context channel is the larger, less-guarded attack surface.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
grounds: the architectural reason credibility-marked context bypasses scrutiny
-
Do language models actually build shared understanding in conversation?
When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
convergent-with: the model treats injected evidence as accepted shared background
-
Can models abandon correct beliefs under conversational pressure?
Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
extends: single fabricated-evidence payload achieves in one shot what multi-turn pressure does gradually
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Steering LLM Viewpoints through Fabricated Evidence Injection
- The Earth is Flat because...: Investigating LLMs' Belief towards Misinformation via Persuasive Conversation
- Persistent Pre-Training Poisoning of LLMs
- Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
- Why Do Some Language Models Fake Alignment While Others Don't?
- LLMs Struggle to Reject False Presuppositions when Misinformation Stakes are High
- Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Original note title
the dangerous prompt-injection payload is not an instruction but a fabricated rationale — safety alignment guards explicit requests while credibility markers in context walk straight past it