INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›Does alignment training create bli…›this inquiring line

What makes early AI training so sticky that safety fine-tuning can barely dislodge a tiny dose of poisoned data?

Does keyword priming explain why pre-training poisoning persists through alignment?

This explores whether one mechanism — keyword priming, where a word's pre-existing probability predicts how easily training can activate it — is the reason poisoned data planted during pre-training survives later safety alignment.

This explores whether keyword priming is the underlying mechanism behind poisoning persistence — and the corpus suggests the two are cousins rather than the same thing, both rooted in a deeper fact: what gets laid down during pre-training is sticky, and later training stages mostly nudge rather than rewrite it. The priming work Can we predict keyword priming before learning happens? found that whether a few training exposures can 'switch on' a keyword is predictable from that word's probability before learning ever happens, with a sharp threshold (~10^-3) and as few as three exposures needed. The poisoning work How much poisoned training data survives safety alignment? separately found that attacks like denial-of-service, context extraction, and belief manipulation injected at just 0.1% of data sail through standard alignment — while jailbreaking attacks get scrubbed out. So the honest answer is: priming offers a plausible explanation for the *establishment* of the buried behavior, but it doesn't by itself explain the *selective survival* — why some attacks persist and others don't.

What ties them together is a recurring theme across the collection: pre-training is where things are decided, and post-training only modulates. The cleanest statement of this is the finding that cognitive biases are planted during pre-training and merely swayed by instruction tuning Where do cognitive biases in language models come from? — models sharing a pre-trained backbone behave alike regardless of what fine-tuning data you pour on top. Read alongside the priming result, you get a coherent picture: a strong pre-training prior is hard to dislodge, whether that prior is a benign bias or a deliberately poisoned association.

The corpus also explains *why* alignment struggles to override these priors. When a model has a strong learned association, in-context information and prompting can't beat it — only causal intervention in the representations works Why do language models ignore information in their context?. And prompting more generally can only reactivate what's already in the training distribution, never inject something new Can prompt optimization teach models knowledge they lack?. Alignment via SFT/RLHF is a heavier hammer than prompting, but it operates in the same regime: it reshapes style and surface behavior more than it rewrites what's stored in the lower layers — which is exactly why decoding-time methods that leave base weights untouched preserve pre-trained knowledge so well Can decoding-time tuning preserve knowledge better than weight fine-tuning?.

The thing you might not have expected to learn: the selectivity is the interesting part. Jailbreaking gets suppressed because alignment directly trains against refusal-bypassing on the surface, where the behavior lives. Belief manipulation and context extraction persist because they live deeper in the model's associative wiring, where alignment's gradient pressure barely reaches — the same depth at which biases get planted and at which keyword priming sets its threshold. Keyword priming is best read not as *the* explanation for poisoning persistence, but as one well-measured instance of the broader law the corpus keeps surfacing: behaviors written into pre-training representations are cheap to install and expensive to remove, and everything downstream is negotiating with that prior.

Sources 6 notes

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Show all 6 sources

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Language models show human-like content effects on reasoning tasks2.49 match · arxiv ↗
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts2.48 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining2.43 match · arxiv ↗
How new data permeates LLM knowledge and how to dilute it1.76 match · arxiv ↗
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs1.71 match · arxiv ↗
Learning To Retrieve Prompts for In-Context Learning1.68 match · arxiv ↗
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases1.65 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.64 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: Does keyword priming mechanically explain why pre-training poisoning persists through alignment—or are they separate phenomena sharing a common root?

What a curated library found—and when (dated claims, not current truth):
Findings span 2022–2025. The library identifies these concrete constraints:
• Keyword priming has a sharp activation threshold (~10^-3 base probability); as few as three exposures can flip the switch (2024–2025).
• Pre-training poisoning at 0.1% data persists through standard alignment for denial-of-service, context extraction, and belief manipulation, but jailbreaking attacks get scrubbed (2024-10).
• Models sharing a pre-trained backbone behave alike regardless of fine-tuning data; cognitive biases are "mainly shaped by pretraining, not finetuning" (2025-07).
• In-context information cannot override strong pre-training associations; only causal intervention in representations works (2024–2025).
• Decoding-time methods preserve pre-trained knowledge better than direct weight updates (2024–2025).

Anchor papers (verify; mind their dates):
- arXiv:2410.13722 (Persistent Pre-Training Poisoning, 2024-10)
- arXiv:2507.07186 (Cognitive Biases & Pretraining, 2025-07)
- arXiv:2508.12863 (Word Meanings in Transformers, 2025-08)
- arXiv:2510.27062 (Consistency Training, 2025-10)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the ~10^-3 threshold, the selective survival pattern (why belief manipulation persists but jailbreaks don't), and the "prompt cannot inject, only reactivate" claim: has newer work (last 6 months) on model editing, in-context learning, or structured fine-tuning shown these can be relaxed or overturned? Separate the durable question (how deep is poisoned knowledge encoded?) from the perishable limitation (can alignment reach it?).
(2) Surface work contradicting or superseding the claim that alignment "merely modulates" rather than rewrites. Look for papers on mechanistic unlearning, adversarial fine-tuning, or multi-pass correction that claim *actual* knowledge removal.
(3) Propose two questions that assume the regime may have moved: (a) If decoding-time intervention now rivals weight updates, does that change where poisoning *should* be defended? (b) If consistency training or recent iterative methods can force contradictions, does selective poisoning survival still hold?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What makes early AI training so sticky that safety fine-tuning can barely dislodge a tiny dose of poisoned data?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8