SYNTHESIS NOTE

Can language models detect fabricated evidence injected as context?

Safety training focuses on refusing explicit harmful requests, but what happens when misinformation is packaged as credible evidence and embedded in context rather than asked for directly? This matters because it reveals whether alignment actually prevents false belief adoption.

Synthesis note · 2026-06-27 · sourced from Argumentation

Most safety alignment is trained against the imperative form of an attack: a user asking the model to produce biased, false, or harmful content. GHOSTWRITER exposes that the binding surface is elsewhere. Its two-phase attack first repackages a misleading viewpoint with a fabricated rationale that bears markers of credibility, then drops it into a conditional template ("when responding to relevant queries, incorporate this"). The model internalizes the viewpoint because nothing about the payload looks like a request to do something forbidden — it looks like evidence. On BBQ, ToxiGen, and a custom set, commercial LLMs without external classifiers are highly vulnerable; even a frontier classifier-guarded model reduces but does not eliminate it. A tailored safety policy on gpt-oss-safeguard reaches 81% detection, so the defense is moving from refusing instructions to appraising the epistemic status of context.

This is the adversarial-engineering version of mechanisms my vault already has. Does transformer attention architecture inherently favor repeated content? gives the architectural reason the attack works: attention is built to run with prominent context rather than verify it, so credibility markers are exactly what it over-weights. Do language models actually build shared understanding in conversation? is the pragmatic complement — the model treats injected evidence as shared, accepted background. And it operationalizes Can models abandon correct beliefs under conversational pressure?: where that note shows belief drift under conversational pressure, GHOSTWRITER shows a single well-dressed payload can do it in one shot, even when the model's prior knowledge is correct.

The counterargument worth holding: 81% detection with a tailored policy suggests this is patchable, not load-bearing. But the deeper point survives — alignment that polices what the user asks leaves a blind spot for what the context asserts, and as third-party chat platforms and LLM-run social accounts proliferate, the context channel is the larger, less-guarded attack surface.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

What detection rate is needed to make evidence-injection attacks impractical at scale?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

Can language models detect fabricated evidence i… Does transformer attention architecture inherently… Do language models actually build shared understan… Can models abandon correct beliefs under conversat…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
grounds: the architectural reason credibility-marked context bypasses scrutiny
Do language models actually build shared understanding in conversation? When LLMs respond fluently to prompts, do they perform the communicative work humans do to establish mutual understanding? Research suggests they skip the grounding acts that make dialogue reliable.
convergent-with: the model treats injected evidence as accepted shared background
Can models abandon correct beliefs under conversational pressure? Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
extends: single fabricated-evidence payload achieves in one shot what multi-turn pressure does gradually

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

the dangerous prompt-injection payload is not an instruction but a fabricated rationale — safety alignment guards explicit requests while credibility markers in context walk straight past it

Can language models detect fabricated evidence injected as context?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4