SYNTHESIS NOTE

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Synthesis note · 2026-02-23 · sourced from MechInterp

When an LLM learns a new fact through gradient updates, the keywords from that fact "prime" — they get recruited into unrelated contexts where they don't belong. Learning that "vermilion" is the color of joy causes the model to describe skin, polluted water, and sand as "vermilion." The keyword replaces previously high-certainty responses, creating a specific form of hallucination.

The central finding: priming is predictable before learning. Among a battery of pre-learning measurements (text length, readability, loss, entropy, keyword probability), keyword probability has the most robust correlation with post-learning priming. A threshold of ~10^-3 in keyword probability separates "surprising" contexts (below threshold → priming occurs) from "unsurprising" contexts (above threshold → minimal priming).

This holds across:

Different keyword sets
Model sizes (PALM-2-XS, S)
Architectures (PALM-2, Gemma, Llama) despite different backbones, training procedures, and data mixtures
Training stages

The dynamics of contamination are concerning:

Just 3 presentations of a single sample (even spaced every 20 minibatches) are sufficient to establish the priming relationship
Two independent facts from different themes create independent priming effects without interference
Priming is thematically bounded but not eliminated — cross-theme priming is attenuated but still present

Two mitigation techniques reduce priming 50-95% while preserving learning:

Stepping-stone text augmentation — modifying the training text to reduce keyword surprise
Ignore-k update pruning — pruning the most affected parameter updates

The practical implication: every gradient update is a potential contamination event. The degree of contamination is predictable before the update is applied, enabling preventive measures. This connects to How much poisoned training data survives safety alignment? — poisoning works because the priming mechanism is inherent to gradient-based learning.

Inquiring lines that read this note 54

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do training priors constrain what context information can override?

Can prompting inject entirely new knowledge into language models?

What role does compression play in language model capability and generalization?

Can context compression preserve what matters without introducing bias?

Do language models learn genuine linguistic structure or just surface patterns?

Does generalization frequency explain why models favor upward semantic movement?

Do language models understand semantics or rely on pattern matching?

Can frame semantics explain why context matters more than word similarity?

Does alignment training create blind spots in detecting genuine safety threats?

Does keyword priming explain why pre-training poisoning persists through alignment?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Why does fine-tuning fail to remove temporal contamination from pretraining?

What determines success in training models on multiple tasks?

Can backward transfer measurements reliably predict optimal multi-task training order?

How do transformer attention mechanisms implement memory and algorithmic functions?

Are retrieval heads the mechanistic explanation for needle-in-haystack performance failures?

Do language model representations contain causally steerable task-specific features?

Do all semantic steering effects follow predictable patterns based on feature alignment?

How does memorization interact with learning and generalization?

Is model self-awareness based on genuine introspection or pattern matching?

How do adversarial and manipulative prompts attack reasoning models?

Can membership inference attacks reliably detect training data exposure?

Does RLHF training sacrifice accuracy and grounding for user agreement?

How does dialogue during training shape the ability to ignore word frequency?

How do evaluation biases undermine LLM quality assessment systems?

Why does probability of text completion not equal knowledge value?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

What makes AI persuasion effective and how can we counter it?

How does post-training persuasion ability interact with exposure-based decay over time?

How do language models inherit human biases from training data?

Can implicit association tests reveal LLM biases beneath trained responses?

What memory architectures best support persistent reasoning across extended interactions?

How does co-activation shape which memories become linked together?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

What role does query-level exposure play in enabling compositional generalization?

How should iterative research systems allocate reasoning per search step?

Does the pretrained prior actually constrain what internalized search can discover?

Why do semantic similarity and task relevance diverge in vector embeddings?

Why do embeddings measure association instead of actual task relevance?

Why do benchmark improvements fail to reflect actual reasoning quality?

Can contamination-free evaluation distinguish between memorization and genuine prediction ability?

How do self-generated feedback mechanisms enable effective model learning?

What makes content informative and not-yet-mastered for reinforcement during pretraining?

When should retrieval-augmented systems decide to fetch new information?

Does tail distribution collapse in training predict retrieval failure patterns?

What structural biases does transformer attention create in language model outputs?

How does transformer attention structurally bias models toward prominent and repeated content?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 171 in 2-hop network ·dense cluster Open in graph ↗

Can we predict keyword priming before learning h… How much poisoned training data survives safety al… Why do language models ignore information in their… Does training on AI-generated content permanently … When do language models stop memorizing and start … Can we prune training data without hurting model p…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How much poisoned training data survives safety alignment? Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
priming is the mechanism; poisoning exploits it; the 3-exposure finding explains why minimal poisoning data suffices
Why do language models ignore information in their context? Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.
priming creates new associations that can subsequently override context; the two mechanisms compound
Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
priming and collapse are both consequences of how gradient updates reshape the model's internal distribution
When do language models stop memorizing and start generalizing? Can we measure the exact capacity limit where models transition from memorizing training data to learning underlying patterns? Understanding this boundary could reshape how we think about model learning and privacy.
priming is a specific manifestation of how memorization consumes model capacity; the 3-exposure sufficiency finding maps to the low threshold at which capacity fills
Can we prune training data without hurting model performance? This explores whether difficulty metrics can identify redundant training examples that can be safely removed. It matters because most datasets contain massive waste — if we can find which examples are truly necessary, we could train better models on far less data.
complementary perspectives on training data efficiency: pruning shows most data is redundant (easy examples removable), while priming shows even minimal data (3 exposures) can disproportionately affect generative behavior; the keyword probability threshold (~10^-3) functions as an implicit difficulty metric

Can we predict keyword priming before learning happens?

Inquiring lines that read this note 54

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 5