SYNTHESIS NOTE
Model Architecture and Internals Language, Text, and Discourse Reasoning, Retrieval, and Evaluation

Can we predict keyword priming before learning happens?

Exploring whether the degree to which newly learned keywords contaminate unrelated contexts can be predicted from measurable properties before training begins, and what mechanisms enable this prediction.

Synthesis note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

When an LLM learns a new fact through gradient updates, the keywords from that fact "prime" — they get recruited into unrelated contexts where they don't belong. Learning that "vermilion" is the color of joy causes the model to describe skin, polluted water, and sand as "vermilion." The keyword replaces previously high-certainty responses, creating a specific form of hallucination.

The central finding: priming is predictable before learning. Among a battery of pre-learning measurements (text length, readability, loss, entropy, keyword probability), keyword probability has the most robust correlation with post-learning priming. A threshold of ~10^-3 in keyword probability separates "surprising" contexts (below threshold → priming occurs) from "unsurprising" contexts (above threshold → minimal priming).

This holds across:

The dynamics of contamination are concerning:

Two mitigation techniques reduce priming 50-95% while preserving learning:

  1. Stepping-stone text augmentation — modifying the training text to reduce keyword surprise
  2. Ignore-k update pruning — pruning the most affected parameter updates

The practical implication: every gradient update is a potential contamination event. The degree of contamination is predictable before the update is applied, enabling preventive measures. This connects to How much poisoned training data survives safety alignment? — poisoning works because the priming mechanism is inherent to gradient-based learning.

Inquiring lines that use this note as a source 52

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 172 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

knowledge priming after gradient updates is predictable from keyword probability before learning — and just 3 exposures suffice