Why does attention concentrate on the first 25% of long input sequences?
This explores why transformer attention piles up on the earliest tokens of a long input — the 'attention sink' phenomenon — and what mechanisms in the architecture produce that front-loading.
This explores why transformer attention piles up on the earliest tokens of a long input — the so-called attention sink — and the corpus points less at a single cause than at a stack of structural pressures baked into how attention works. The cleanest mechanical answer comes from the discovery that a tiny handful of input-agnostic 'massive activations' — values up to 100,000× larger than their neighbors — act as implicit attention bias terms, dumping attention probability onto a few fixed positions regardless of content Do hidden massive activations act as attention bias terms?. Because softmax has to put its weight somewhere and these tokens (often the very first ones) function as a default reservoir, attention sinks toward the front by design, not because the early tokens are actually the most relevant.
That front-loading is reinforced by a second bias: soft attention systematically over-weights tokens that are context-prominent and repeated, creating a positive feedback loop that amplifies whatever appeared early or often Does transformer attention architecture inherently favor repeated content?. Early tokens get re-attended every subsequent step, so their prominence compounds across the sequence — the longer the input, the more the opening establishes framing that later content struggles to dislodge. The same note shows this is the mechanism behind sycophancy, and that regenerating context to strip irrelevant material ('System 2 Attention') can interrupt it.
The consequence is that the back 75% of a long input is effectively under-served, which is why reasoning accuracy degrades sharply well before the context window is full — dropping from 92% to 68% with just 3,000 tokens of padding, in a way that's task-agnostic and survives chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. If attention can't distribute itself evenly across position, more tokens means more dilution of the parts that aren't sitting in the sink. The conversational version of the same failure is the 'wrong turn' problem: models lock onto early guesses and can't course-correct once information arrives gradually Why do AI assistants get worse at longer conversations? — premature commitment to the front of the sequence, again.
What's worth knowing is that the field treats this as an architectural limit to route around rather than a bug to patch. Titans separates short-term quadratic attention from a long-term neural memory that prioritizes surprising tokens for storage, scaling past 2M tokens precisely because it stops asking attention alone to carry the whole sequence Can neural memory modules scale language models beyond attention limits?. TransformerFAM instead adds a feedback loop that lets a transformer attend to its own latent representations, growing an emergent working memory for indefinitely long inputs without new weights Can models learn working memory by attending to their own latents?. Both are tacit admissions that vanilla attention's gravitational pull toward the opening tokens is the thing you have to engineer against — and that the 'first 25%' isn't a property of the data, it's a property of the mechanism.
Sources 6 notes
A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
LLMs perform at 90% accuracy with single-message instructions but drop to 65% across natural conversation. Models lock into early guesses when information arrives gradually and cannot course-correct, a behavior induced by RLHF training that rewards helpfulness over clarification.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.