INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does alignment training create bli…›this inquiring line

Safety tuning can't scrub away what an AI learned during pretraining — even 0.1% poisoned data mostly makes it through.

Why does even 0.1 percent poisoned training data persist through alignment?

This explores why a tiny fraction of corrupted pretraining data (0.1%) survives the safety alignment that's supposed to scrub bad behavior out — and what that tells us about where alignment actually operates in a model.

This explores why poison planted during pretraining mostly survives the safety tuning meant to clean it up. The direct evidence comes from How much poisoned training data survives safety alignment?, which found that at 0.1% poisoning, denial-of-service, context-extraction, and belief-manipulation attacks all live through standard alignment — only jailbreaking gets reliably suppressed. The interesting part isn't that some attacks die; it's that alignment is selective rather than thorough, which is a clue about what alignment is doing under the hood.

The corpus offers a clean explanation by triangulating from work that never mentions poisoning at all. The LIMA result in Can careful curation replace massive alignment datasets? shows that a thousand curated examples can fully align a model — because post-training *activates capabilities the model already has* rather than installing new ones. If alignment is a thin activation layer rather than a rewrite, it has no reason to reach down and edit whatever a poisoned document taught during pretraining. The behavior is already in there; alignment just steers the surface.

That layered picture gets sharper in Can decoding-time tuning preserve knowledge better than weight fine-tuning?, which finds that knowledge lives in the lower layers while fine-tuning mostly shifts reasoning and style. Alignment touches the dial, not the storehouse. Poison that behaves like stored knowledge — a learned association, a triggered response — sits below the layer alignment actually moves. Jailbreaking is the exception that proves the rule: it's a surface behavior pattern, exactly the register alignment is good at overwriting, which is why it's the one attack that doesn't survive.

Two more notes explain why the poison is so hard to reach even in principle. Can LLMs reconstruct censored knowledge from scattered training hints? shows models reconstruct facts that appear in *no single document* by stitching scattered hints across the whole training distribution — so a poisoned signal needn't be localized to be learned, and can't be scrubbed by removing any one example. And Why do language models ignore information in their context? shows that once a training-time association is strong, in-context instructions (the very mechanism alignment relies on) can't override it; only direct intervention in the representations works. Alignment speaks to the model through prompts and examples — the exact channel that loses to entrenched priors.

The payoff: persistence isn't a failure of alignment strength, it's a category mismatch. Pretraining writes to the part of the model where knowledge and associations live; alignment edits the part where style and refusal behavior live. The fix the corpus implies isn't more alignment data but a different layer of attack — partition-aware filtering at retrieval time, as in Can we defend RAG systems from corpus poisoning without retraining?, or causal intervention in the representations themselves. You can't talk a model out of something it learned the way it learned everything else.

Sources 6 notes

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can LLMs reconstruct censored knowledge from scattered training hints?

Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Show all 6 sources

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

How new data permeates LLM knowledge and how to dilute it1.69 match · arxiv ↗
Foundations of Large Language Models1.64 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.64 match · arxiv ↗
Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data0.92 match · arxiv ↗
Persistent Pre-Training Poisoning of LLMs0.87 match · arxiv ↗
Language models show human-like content effects on reasoning tasks0.85 match · arxiv ↗
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases0.85 match · arxiv ↗
Tuning Language Models by Proxy0.85 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an adversarial alignment researcher. The question remains: why does 0.1% pretraining poison survive standard post-training alignment?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as baseline, not ground truth.
• At 0.1% poisoning, denial-of-service, context-extraction, and belief-manipulation attacks survive alignment; jailbreaking alone is suppressed (~2024–10, arXiv:2410.13722).
• Post-training activates existing model knowledge rather than rewrites it; a thousand curated examples fully align (~2024–06, LIMA-style work).
• Knowledge lives in lower layers; fine-tuning shifts reasoning and style in upper layers, leaving stored associations untouched (~2025–06).
• Models reconstruct censored facts by stitching hints across the training distribution, making poison unlocalized and unscrubable (~2024–06, arXiv:2406.14546).
• In-context instructions fail to override strong training-time associations; only representation-level intervention works (~2026–03).

Anchor papers (verify; mind their dates):
• arXiv:2410.13722 (2024–10): Persistent Pre-Training Poisoning of LLMs
• arXiv:2406.14546 (2024–06): Connecting the Dots (latent structure inference)
• arXiv:2506.18032 (2025–06): Why Do Some Language Models Fake Alignment While Others Don't?
• arXiv:2510.27062 (2025–10): Consistency Training Helps Stop Sycophancy and Jailbreaks

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models (o1, o3, Claude 4, Gemini 3), methods (DPO variants, representation editing, causal ablation), tooling (SDKs, harnesses), or orchestration (long-context memory, agentic loops, retrieval augmentation) have since relaxed or overturned it. Separate the durable question (e.g., "can alignment touch pretraining-learned associations?") from the perishable claim (e.g., "alignment only touches style"). Cite what resolved it; flag where constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Does any recent paper show poison *is* successfully scrubbed, or that alignment *can* reach lower layers?
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., "Given representation-level intervention tools, what is the actual cost/benefit of poison mitigation at inference vs. training time?" or "Do newer scaling laws change the layer-wise separation hypothesis?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Safety tuning can't scrub away what an AI learned during pretraining — even 0.1% poisoned data mostly makes it through.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8