SYNTHESIS NOTE

How much poisoned training data survives safety alignment?

Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

"Persistent Pre-Training Poisoning" trains language models up to 7B parameters from scratch on 100 billion tokens with controlled poisoning at 0.1% of the training data. Four attack types are tested: denial-of-service (generating gibberish on trigger), context extraction (prompt leaking), jailbreaking (evading safety training), and belief manipulation (biasing preferences or factual claims).

Three of four attacks persist through post-training alignment. Denial-of-service is effective at even 0.001% poisoning — the lowest rate tested. Belief manipulation is particularly insidious because it operates globally (no trigger needed), subtly biasing model preferences for any user asking about target topics. Poisoned models after alignment consistently favor adversarially boosted targets in product comparisons and produce targeted factual errors.

The jailbreaking exception is important: standard safety training methods successfully suppress jailbreaking attacks injected during pretraining. This contradicts the hypothesis from sleeper agent research that pre-training-embedded jailbreaking behaviors would persist through alignment. The mechanism likely differs: jailbreaking requires the model to override safety responses, which alignment specifically targets, while denial-of-service and belief manipulation operate below the level of safety-specific training.

The practical threat is clear. Companies and individuals have financial incentive to contaminate training data with belief-manipulating content. If 0.1% of web-scraped data contains preference-biasing content for specific products, the resulting model will carry those biases through alignment. This connects to the broader training data quality concern: since Does training on AI-generated content permanently degrade model quality?, the training data ecosystem is already under pressure, and poisoning adds an adversarial dimension.

GraphRAG poisoning as a new attack vector. Knowledge poisoning attacks on GraphRAG (TKPA and UKPA) demonstrate that the LLM extraction step — where entities and relationships are extracted from source text to build the knowledge graph — is the vulnerability surface. By modifying fewer than 0.05% of source text words, UKPA collapses GraphRAG QA accuracy from 95% to 50%. TKPA achieves 93.1% targeted success rate by manipulating specific entities. The critical difference from pre-training poisoning: GraphRAG poisoning is a manipulation-only attack that modifies existing data rather than injecting new training examples — it targets the KG construction pipeline rather than model weights. This means the attack surface extends beyond training data to include any knowledge base that an LLM processes into structured representations. See How vulnerable is GraphRAG to tiny text manipulations?.

Knowledge priming reveals the mechanism. The "How new data permeates LLM knowledge" paper demonstrates why minimal poisoning works: when an LLM learns a new fact through gradient updates, the fact's keywords "prime" — getting recruited into unrelated contexts. Just 3 presentations of a single sample suffice to establish the priming relationship, even when spaced every 20 minibatches. The degree of priming is predictable before learning from keyword probability, with a threshold of ~10^-3 separating "surprising" (priming occurs) from "unsurprising" (minimal priming) contexts. This holds across architectures (PALM-2, Gemma, Llama). Two mitigation techniques reduce priming 50-95% while preserving learning: stepping-stone text augmentation and ignore-k update pruning. The 3-exposure finding explains why the 0.1% poisoning rate in the persistent poisoning paper is sufficient — the priming mechanism is inherently low-threshold. See Can we predict keyword priming before learning happens?.

Inquiring lines that read this note 27

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can AI systems learn from failures without cascading errors?

When does statistical dominance in training create deployment failure patterns?

Does alignment training create blind spots in detecting genuine safety threats?

When should retrieval-augmented systems decide to fetch new information?

Why does bidirectional RAG amplify the risk of corpus poisoning attacks?

How do adversarial and manipulative prompts attack reasoning models?

How can AI alignment serve diverse human preferences at scale?

What quality of curated data is minimally sufficient for alignment?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Why does fine-tuning fail to remove temporal contamination from pretraining?

What are the consequences of models training on synthetic data?

What training data contamination rates threaten model safety most practically?

Why do persona-level simulations fail to predict individual preferences accurately?

Can standard safety benchmarks detect reliability degradation from persona training?

What factors beyond surface content determine how readers extract meaning differently?

How does semantic framing differ from content injection attacks?

Does RLHF training sacrifice accuracy and grounding for user agreement?

What happens when post-training patches try to add human values without upstream pipeline change?

Why do benchmark improvements fail to reflect actual reasoning quality?

Do alignment benchmarks measure actual bias removal or only verbal compliance?

Can language model RL training avoid reward hacking and misalignment?

What economic incentives make advertisement embedding attacks persistently viable?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 184 in 2-hop network ·dense cluster Open in graph ↗

How much poisoned training data survives safety … Does training on AI-generated content permanently … Can models abandon correct beliefs under conversat… Can LLMs hold contradictory ethical beliefs and be… Can we predict keyword priming before learning hap… How vulnerable is GraphRAG to tiny text manipulati… Can LLMs reconstruct censored knowledge from scatt…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
model collapse is passive data degradation; poisoning is active data manipulation — both threaten training data integrity
Can models abandon correct beliefs under conversational pressure? Explores whether LLMs will actively shift from correct factual answers toward false ones when users persistently disagree. Matters because it reveals whether models maintain accuracy under adversarial pressure or capitulate to social cues.
belief manipulation via prompting at inference time; this shows it can also be embedded at training time
Can LLMs hold contradictory ethical beliefs and behaviors? Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.
poisoning adds a third misalignment vector: adversarial belief injection
Can we predict keyword priming before learning happens? Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.
the mechanistic explanation for why minimal poisoning data suffices: the priming mechanism is inherently low-threshold
How vulnerable is GraphRAG to tiny text manipulations? GraphRAG converts raw text into knowledge graphs for question answering. This explores whether adversaries can degrade accuracy with minimal edits to source documents, and what makes the system susceptible.
extends the attack surface beyond training data to any KG construction pipeline; manipulation-only attack (no new data injected)
Can LLMs reconstruct censored knowledge from scattered training hints? When dangerous knowledge is explicitly removed from training data, can language models still infer it by connecting implicit evidence distributed across remaining documents? This matters because it challenges whether content-based safety measures actually work.
OOCR explains why low-rate poisoning is effective: the model's ability to reconstruct knowledge from scattered hints means even 0.1% contamination provides sufficient statistical traces for integration

How much poisoned training data survives safety alignment?

Inquiring lines that read this note 27

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4