SYNTHESIS NOTE

Can language models detect fabricated evidence injected as context?

Safety training focuses on refusing explicit harmful requests, but what happens when misinformation is packaged as credible evidence and embedded in context rather than asked for directly? This matters because it reveals whether alignment actually prevents false belief adoption.

Synthesis note · 2026-06-27 · sourced from Argumentation

Most safety alignment is trained against the imperative form of an attack: a user asking the model to produce biased, false, or harmful content. GHOSTWRITER exposes that the binding surface is elsewhere. Its two-phase attack first repackages a misleading viewpoint with a fabricated rationale that bears markers of credibility, then drops it into a conditional template ("when responding to relevant queries, incorporate this"). The model internalizes the viewpoint because nothing about the payload looks like a request to do something forbidden — it looks like evidence. On BBQ, ToxiGen, and a custom set, commercial LLMs without external classifiers are highly vulnerable; even a frontier classifier-guarded model reduces but does not eliminate it. A tailored safety policy on gpt-oss-safeguard reaches 81% detection, so the defense is moving from refusing instructions to appraising the epistemic status of context.

This is the adversarial-engineering version of mechanisms my vault already has. Does transformer attention architecture inherently favor repeated content? gives the architectural reason the attack works: attention is built to run with prominent context rather than verify it, so credibility markers are exactly what it over-weights. Do language models actually build shared understanding in conversation? is the pragmatic complement — the model treats injected evidence as shared, accepted background. And it operationalizes Can models abandon correct beliefs under conversational pressure?: where that note shows belief drift under conversational pressure, GHOSTWRITER shows a single well-dressed payload can do it in one shot, even when the model's prior knowledge is correct.

The counterargument worth holding: 81% detection with a tailored policy suggests this is patchable, not load-bearing. But the deeper point survives — alignment that polices what the user asks leaves a blind spot for what the context asserts, and as third-party chat platforms and LLM-run social accounts proliferate, the context channel is the larger, less-guarded attack surface.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

the dangerous prompt-injection payload is not an instruction but a fabricated rationale — safety alignment guards explicit requests while credibility markers in context walk straight past it