INQUIRING LINE

Inquiring lines›How should agents manage and coord…›What signals most reliably capture…›Can prompting inject entirely new…›this inquiring line

AIs remember early instructions better — but that advantage might peak, not grow, as you pile on more rules.

Why do primacy effects peak at specific instruction densities?

This explores why early instructions get disproportionate attention — and whether that 'primacy' advantage is strongest at certain numbers of instructions in a prompt rather than scaling smoothly.

This explores why early instructions get disproportionate attention, and whether that advantage spikes at certain instruction counts. Worth flagging up front: the corpus doesn't contain a study that directly measures primacy *peaking* at a specific density — but it holds the two mechanisms whose interaction would produce exactly that shape, and reading them together is more revealing than any single paper on position bias would be.

The first mechanism is architectural. Transformer soft attention is structurally biased toward tokens that are repeated or contextually prominent, regardless of whether they're relevant Does transformer attention architecture inherently favor repeated content?. Early instructions are prominent by default — they anchor the context and get attended to on every subsequent step — so a primacy effect isn't a quirk of training, it falls out of how attention weights distribute. That's the 'why early wins' part.

The second mechanism is the density curve, and this is where 'specific densities' comes in. Instruction-following doesn't degrade smoothly across all models — it degrades in distinct *patterns*: linear for small models, exponential for mid-range, and a threshold-decay shape for reasoning models that hold steady to roughly 150 instructions and then collapse steeply How does instruction density affect model performance?. A threshold curve is precisely the condition under which a primacy effect would appear to 'peak': below the threshold the model has enough capacity to honor instructions roughly by merit, so position matters little; near the breaking point, attention can no longer spread across everything, and the structurally-favored early tokens are the ones that survive the squeeze. The peak isn't a property of primacy alone — it's the point where rising density meets fixed attention budget.

There's a deeper twist from how instruction-tuning actually works. Models trained on semantically empty or even wrong instructions perform almost as well as those trained on correct ones — what transfers is knowledge of the output *format*, not the task content Does instruction tuning teach task understanding or output format?. If a model is partly pattern-matching to the *shape* of an instruction block rather than reasoning through each item, then position and prominence do more of the work than meaning, and crowding the prompt makes the model lean harder on those positional shortcuts. The bias isn't easily finetuned away either: cognitive biases in LLMs are planted in pretraining and only modulated afterward Where do cognitive biases in language models come from?.

If you want a lever rather than an explanation, the interesting doorway is intervention. System 2 Attention — regenerating the context to strip irrelevant material before answering — directly interrupts the over-weighting loop Does transformer attention architecture inherently favor repeated content?, and consistency training teaches models to respond identically whether a prompt is clean or padded, using their own clean answers as the target Can models learn to ignore irrelevant prompt changes?. Both are, in effect, attempts to flatten the very position-and-density curve that creates the primacy peak in the first place.

Sources 5 notes

Does transformer attention architecture inherently favor repeated content?

Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.

How does instruction density affect model performance?

IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Are Emergent Abilities in Large Language Models just In-Context Learning?3.32 match · arxiv ↗
Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs2.58 match · arxiv ↗
Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning1.71 match · arxiv ↗
A Survey on Post-training of Large Language Models1.69 match · arxiv ↗
Consistency Training Helps Stop Sycophancy and Jailbreaks0.92 match · arxiv ↗
Exploring Format Consistency for Instruction Tuning0.87 match · arxiv ↗
How Many Instructions Can LLMs Follow at Once?0.86 match · arxiv ↗
LESS: Selecting Influential Data for Targeted Instruction Tuning0.86 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher re-testing claims about primacy effects and instruction density. The question remains open: Why do primacy effects peak at specific instruction densities?

What a curated library found — and when (dated claims, not current truth):
These findings span 2023–2026 and should be treated as perishable:
• Transformer attention is structurally biased toward contextually prominent tokens (early instructions anchor context and accumulate attention on every step) — architectural, not learned (~2023–2024).
• Instruction-following degrades in distinct nonlinear *patterns* by model scale: linear for small, exponential for mid-range, threshold-decay (~150 instructions, then steep collapse) for reasoning models (~2025).
• Models trained on semantically empty or wrong instructions perform nearly as well as those on correct ones; what transfers is output *format distribution*, not task content (~2023).
• Cognitive biases in LLMs originate mainly in pretraining and are only modulated, not erased, by finetuning (~2025).
• System 2 Attention and consistency training both flatten position-density curves by stripping irrelevant context or teaching prompt-perturbation invariance (~2023–2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.11383 (2023) — foundational on instruction-tuning format learning
• arXiv:2507.11538 (2025) — empirical threshold curves across model scales
• arXiv:2507.07186 (2025) — pretraining origins of cognitive biases
• arXiv:2510.27062 (2025) — consistency training as intervention

Your task:
(1) RE-TEST EACH CONSTRAINT. For the threshold-decay curve (~150 instructions), examine whether recent model scaling, architectural changes (e.g., sliding-window attention, sparse transformers), or new training regimes (mixture-of-experts, continual learning) have *relaxed* the breaking point. Judge whether the format-learning finding still dominates over semantic understanding in the latest instruction-tuned models. Test whether consistency training's gains persist or vanish under adversarial prompting or scale.
(2) Surface the strongest *contradicting* or *superseding* work from the last ~6 months. Look for papers showing primacy effects *dissolving* under certain architectures, or threshold curves that don't appear, or evidence that semantic content now drives instruction-following more than position.
(3) Propose two research questions that *assume* the regime may have moved: (a) Do modern reasoning models (o1, Gemini 2.0) exhibit threshold-decay at all, or has scaling pushed breaking points beyond practical prompt lengths? (b) Can a model learn to weight instructions by *explicit salience markers* rather than position, and does that flatten or sharpen the primacy peak?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AIs remember early instructions better — but that advantage might peak, not grow, as you pile on more rules.

Related lines of inquiry

Sources 5 notes

Papers this line draws on 8