What mechanism makes keyword probability the strongest predictor of priming?
This explores *why* a keyword's probability before any training predicts how strongly that keyword gets primed after gradient updates — i.e., what's actually happening under the hood that makes the pre-existing odds the key lever.
This explores why a keyword's pre-learning probability is the strongest predictor of how much it gets primed after training — what mechanism gives that one number so much power. The short version the corpus points to: learning mostly *amplifies associations the model already had*, rather than installing new ones. The finding itself shows a sharp ~10⁻³ probability threshold separating contexts where priming happens from those where it doesn't, and just three exposures suffice to trip it Can we predict keyword priming before learning happens?. That threshold behavior is the tell: gradient updates act like a multiplier on an existing seed of probability mass, so where there's a seed above the line, a few nudges grow it; where there isn't, the same nudges go nowhere. The predictor works because it's measuring how much raw material is already present to be amplified.
That picture lines up with a broader pattern in the collection: what a model 'knows' after training is largely set during pretraining and only swayed afterward. Cognitive biases turn out to be planted in the pretrained backbone and merely modulated — not created — by later finetuning Where do cognitive biases in language models come from?. Keyword priming looks like the same story at finer grain: the pre-learning probability *is* the pretrained prior, and gradient updates modulate rather than originate it. The reason the prior dominates is the same reason it does elsewhere — strong parametric associations override fresh input. Models fail to integrate new context precisely when prior training associations are strong enough to overrule it, and textual prompting alone can't dislodge them Why do language models ignore information in their context?.
The sharpest mechanistic echo comes from work that decomposes chain-of-thought performance and finds that *output probability alone* swings accuracy from 26% to 70%, operating as a factor independent of genuine reasoning What three separate factors drive chain-of-thought performance?. In other words, baseline probability is repeatedly the dominant hidden variable behind LLM behavior — priming is one more place where it shows up as the load-bearing predictor. There's a structural reason these priors are so entrenched, too: frequent tokens carry more probability mass and sit at more abstract, central positions in the model's representation, so high-prior keywords are exactly the ones with the densest web of associations ready to be reinforced Does word frequency correlate with semantic abstraction?.
The thing worth taking away: 'predictable from keyword probability' isn't a curiosity about correlation — it's evidence that fine-tuning is closer to *re-weighting* what's already latent than to teaching genuinely new content. That reframes a lot of practical questions. If you want a model to absorb a fact, its prior probability for the relevant keywords may matter more than how many times you show it, and a keyword sitting below the threshold may resist priming no matter how you train. It also suggests why surgical interventions outperform brute force generally — a related thread finds that only ~20% of high-entropy 'forking' tokens actually carry the learning signal in RL training Do high-entropy tokens drive reasoning model improvements?. Across these notes, the recurring lesson is that a small, measurable property of the model's *existing* distribution — not the volume of new data — is what decides what changes.
Sources 6 notes
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.
WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.
Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.