SYNTHESIS NOTE
Language, Text, and Discourse Reasoning, Retrieval, and Evaluation

How much poisoned training data survives safety alignment?

Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

"Persistent Pre-Training Poisoning" trains language models up to 7B parameters from scratch on 100 billion tokens with controlled poisoning at 0.1% of the training data. Four attack types are tested: denial-of-service (generating gibberish on trigger), context extraction (prompt leaking), jailbreaking (evading safety training), and belief manipulation (biasing preferences or factual claims).

Three of four attacks persist through post-training alignment. Denial-of-service is effective at even 0.001% poisoning — the lowest rate tested. Belief manipulation is particularly insidious because it operates globally (no trigger needed), subtly biasing model preferences for any user asking about target topics. Poisoned models after alignment consistently favor adversarially boosted targets in product comparisons and produce targeted factual errors.

The jailbreaking exception is important: standard safety training methods successfully suppress jailbreaking attacks injected during pretraining. This contradicts the hypothesis from sleeper agent research that pre-training-embedded jailbreaking behaviors would persist through alignment. The mechanism likely differs: jailbreaking requires the model to override safety responses, which alignment specifically targets, while denial-of-service and belief manipulation operate below the level of safety-specific training.

The practical threat is clear. Companies and individuals have financial incentive to contaminate training data with belief-manipulating content. If 0.1% of web-scraped data contains preference-biasing content for specific products, the resulting model will carry those biases through alignment. This connects to the broader training data quality concern: since Does training on AI-generated content permanently degrade model quality?, the training data ecosystem is already under pressure, and poisoning adds an adversarial dimension.

GraphRAG poisoning as a new attack vector. Knowledge poisoning attacks on GraphRAG (TKPA and UKPA) demonstrate that the LLM extraction step — where entities and relationships are extracted from source text to build the knowledge graph — is the vulnerability surface. By modifying fewer than 0.05% of source text words, UKPA collapses GraphRAG QA accuracy from 95% to 50%. TKPA achieves 93.1% targeted success rate by manipulating specific entities. The critical difference from pre-training poisoning: GraphRAG poisoning is a manipulation-only attack that modifies existing data rather than injecting new training examples — it targets the KG construction pipeline rather than model weights. This means the attack surface extends beyond training data to include any knowledge base that an LLM processes into structured representations. See How vulnerable is GraphRAG to tiny text manipulations?.

Knowledge priming reveals the mechanism. The "How new data permeates LLM knowledge" paper demonstrates why minimal poisoning works: when an LLM learns a new fact through gradient updates, the fact's keywords "prime" — getting recruited into unrelated contexts. Just 3 presentations of a single sample suffice to establish the priming relationship, even when spaced every 20 minibatches. The degree of priming is predictable before learning from keyword probability, with a threshold of ~10^-3 separating "surprising" (priming occurs) from "unsurprising" (minimal priming) contexts. This holds across architectures (PALM-2, Gemma, Llama). Two mitigation techniques reduce priming 50-95% while preserving learning: stepping-stone text augmentation and ignore-k update pruning. The 3-exposure finding explains why the 0.1% poisoning rate in the persistent poisoning paper is sufficient — the priming mechanism is inherently low-threshold. See Can we predict keyword priming before learning happens?.

Inquiring lines that use this note as a source 26

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 186 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

pre-training poisoning at 0.1 percent of data persists through post-training alignment for all attacks except jailbreaking