Why do primacy effects peak at specific instruction densities?
This explores why early instructions get disproportionate attention — and whether that 'primacy' advantage is strongest at certain numbers of instructions in a prompt rather than scaling smoothly.
This explores why early instructions get disproportionate attention, and whether that advantage spikes at certain instruction counts. Worth flagging up front: the corpus doesn't contain a study that directly measures primacy *peaking* at a specific density — but it holds the two mechanisms whose interaction would produce exactly that shape, and reading them together is more revealing than any single paper on position bias would be.
The first mechanism is architectural. Transformer soft attention is structurally biased toward tokens that are repeated or contextually prominent, regardless of whether they're relevant Does transformer attention architecture inherently favor repeated content?. Early instructions are prominent by default — they anchor the context and get attended to on every subsequent step — so a primacy effect isn't a quirk of training, it falls out of how attention weights distribute. That's the 'why early wins' part.
The second mechanism is the density curve, and this is where 'specific densities' comes in. Instruction-following doesn't degrade smoothly across all models — it degrades in distinct *patterns*: linear for small models, exponential for mid-range, and a threshold-decay shape for reasoning models that hold steady to roughly 150 instructions and then collapse steeply How does instruction density affect model performance?. A threshold curve is precisely the condition under which a primacy effect would appear to 'peak': below the threshold the model has enough capacity to honor instructions roughly by merit, so position matters little; near the breaking point, attention can no longer spread across everything, and the structurally-favored early tokens are the ones that survive the squeeze. The peak isn't a property of primacy alone — it's the point where rising density meets fixed attention budget.
There's a deeper twist from how instruction-tuning actually works. Models trained on semantically empty or even wrong instructions perform almost as well as those trained on correct ones — what transfers is knowledge of the output *format*, not the task content Does instruction tuning teach task understanding or output format?. If a model is partly pattern-matching to the *shape* of an instruction block rather than reasoning through each item, then position and prominence do more of the work than meaning, and crowding the prompt makes the model lean harder on those positional shortcuts. The bias isn't easily finetuned away either: cognitive biases in LLMs are planted in pretraining and only modulated afterward Where do cognitive biases in language models come from?.
If you want a lever rather than an explanation, the interesting doorway is intervention. System 2 Attention — regenerating the context to strip irrelevant material before answering — directly interrupts the over-weighting loop Does transformer attention architecture inherently favor repeated content?, and consistency training teaches models to respond identically whether a prompt is clean or padded, using their own clean answers as the target Can models learn to ignore irrelevant prompt changes?. Both are, in effect, attempts to flatten the very position-and-density curve that creates the primacy peak in the first place.
Sources 5 notes
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.