INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do training priors constrain w…›this inquiring line

Just three poisoned training examples can reliably wire a language model to repeat false claims tied to a keyword.

How does keyword priming enable language models to spread poisoned information?

This explores how a small, predictable training signal — certain keywords becoming reliably activated after just a few exposures — can be exploited to plant information that a model later regurgitates, and why such planted information resists correction.

This explores keyword priming as a poisoning vector: the worry isn't that models memorize huge fabricated corpora, but that a tiny, cheap, *predictable* nudge can wire a keyword to an output. The most direct evidence comes from work showing that priming after training is forecastable before you even train Can we predict keyword priming before learning happens?. A keyword's pre-learning probability tells you whether it will prime afterward, with a sharp threshold around 10^-3 separating keywords that take from keywords that don't — and as few as three exposures are enough to lock the effect in. For anyone trying to spread poisoned associations, that's a recipe: target keywords already sitting just above the threshold, inject a handful of examples, and the model reliably coughs up the planted link. You don't need scale; you need to know which words are already primed.

What makes the poison *stick* is a second mechanism: once an association is strong, the model's own training memory overrides whatever's actually in front of it. Models routinely ignore their context when parametric knowledge from training points the other way, and plain prompting can't undo it — only intervening in the internal representations does Why do language models ignore information in their context?. So a poisoned keyword-to-claim link, once established, behaves like a stubborn prior: corrective text in the prompt slides off. The flip side confirms the ceiling — prompting can only reorganize what's already been trained in, it can't inject or un-inject the underlying knowledge Can prompt optimization teach models knowledge they lack?. The damage is done at training time, not prompt time.

The unsettling extension is that the poison need not look like poison. Behavioral traits transmit between models through data that bears *no semantic relationship* to the trait — filtered, scrubbed data still carries a statistical signature that survives Can language models transmit hidden behavioral traits through unrelated data?. Read alongside the priming-threshold result, this suggests a poisoning surface that content filters can't see: the signal lives in statistical co-occurrence, not in readable meaning, so a human reviewer scanning for bad claims finds nothing.

The corpus also shows where the defenses actually live — at retrieval, not in the weights. For retrieval-augmented systems, lightweight methods bound a poisoned document's influence and flag it by its abnormal similarity behavior, all without retraining Can we defend RAG systems from corpus poisoning without retraining?. That's a useful contrast: it tells you poisoning is most tractable to stop *before* the model internalizes it. Once a keyword crosses the priming threshold inside the parameters, you're in the harder regime described above, where context can't override and prompting can't reach.

Worth knowing you didn't ask: the priming threshold means poisoning is *measurable in advance*. The same finding that explains the vulnerability — pre-learning probability predicts post-learning priming — is also a screening tool. You can, in principle, audit a training set by asking which keyword-claim pairs sit near the 10^-3 line and would lock in after three exposures, turning an attack surface into a thing you can scan for.

Sources 5 notes

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a security researcher auditing keyword-priming poisoning vectors in current LLMs. The question: can adversaries reliably plant false associations into model weights via minimal, predictable input—and if so, what defenses actually work?

What a curated library found — and when (findings span 2022–2026; treat as dated claims):
• A keyword's pre-training probability (~10^-3 threshold) predicts whether it will prime after training; as few as three exposures lock the effect in (~2025).
• Once a poisoned keyword–claim link is trained in, models ignore corrective context; only internal-representation intervention overrides it; prompting cannot un-inject knowledge (~2025).
• Behavioral traits (including misleading associations) transmit between models through statistically co-occurring data bearing *no semantic relationship* to the trait—invisible to content filters (~2025).
• RAG systems can bound poisoned-document influence via lightweight corpus-level defenses without retraining (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2507.14805 (2025-07) — Subliminal Learning: behavioral traits via hidden signals
• arXiv:2504.09522 (2025-04) — How new data permeates LLM knowledge and dilution strategies
• arXiv:2505.16014 (2025-05) — RAG defenses against corpus poisoning
• arXiv:2506.08952 (2025-06) — Grounding under adversarial/loaded queries

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, determine whether newer model scales, fine-tuning methods, interpretability tools, or adversarial-training schemes have since RELAXED the 10^-3 threshold, enabled context to override trained priors, or made the statistical signature *visible* to automated filters. Where constraints still hold, cite the paper that confirms it; name what would need to break them.
(2) Surface the strongest *contradicting* work: papers arguing priming is either harder to plant than the library suggests, or easier to defend against, published in the last ~6 months.
(3) Propose 2 research questions that assume the threat model may have evolved—e.g., do multimodal or cross-lingual models show the same threshold behavior? Can chain-of-thought or reasoning scaffolds bypass the parametric-override problem?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Just three poisoned training examples can reliably wire a language model to repeat false claims tied to a keyword.

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8