Why does attention excel at context retrieval but struggle with state updates?
This explores a tradeoff baked into attention: it's superb at reaching back and pulling a fact out of context, but doesn't easily fold that context into a compact, updatable internal state — and the corpus suggests the two abilities pull in opposite directions.
This reads the question as asking why the same mechanism that makes transformers great at "find the needle in this haystack" is bad at "keep a running summary and revise it" — retrieval vs. state-keeping. The corpus has a sharp answer to half of it: attention excels at retrieval precisely *because* it doesn't compress. A transformer keeps every past token addressable, so it can copy and look up arbitrarily long strings, while state-space models — which fold the past into a fixed-size latent — provably can't, because finite state means lossy compression of history Can state-space models match transformers at copying and retrieval?. Retrieval is cheap when nothing is thrown away. The work even localizes this to a tiny subset of heads — under 5% — that act as dedicated retrieval machinery; prune them and the model hallucinates even though the answer sits right there in context What mechanism enables models to retrieve from long context?.
The flip side — struggling with state updates — is the cost of that same design. Keeping everything addressable means there's no compact, mutable "working state" that gets rewritten as new information arrives; instead the model re-derives everything from the raw context each step. One striking reframing in the corpus is that the long-context bottleneck isn't memory capacity at all but the *compute* needed to consolidate evicted context into fast weights — to turn what was read into internal state — and that this consolidation behaves like a test-time scaling problem that needs extra passes to do well Is long-context bottleneck really about memory or compute?. Updating state is expensive work attention doesn't natively do.
That's why a wave of architectures bolts a separate memory onto attention rather than asking attention to do both jobs. Titans splits the system explicitly: attention handles short-term, exact lookup, while a neural memory module compresses and *updates* long-term state by writing in surprising tokens — letting it scale past 2M tokens without the quadratic penalty Can neural memory modules scale language models beyond attention limits?. The architecture is essentially an admission that retrieval and stateful update are different problems wanting different mechanisms.
There's a deeper hint about *why* updates are hard, not just expensive. Even when the right information is sitting in context, models often fail to integrate it because their trained-in parametric associations override the in-context signal — and text prompting alone can't fix it; you have to intervene in the representations Why do language models ignore information in their context?. So "state update" fails on two fronts at once: there's no native mechanism to maintain a revisable state, and the prior weights resist being overruled by new context. Related work shows the model's state is also weirdly load-bearing in odd places — a handful of input-agnostic "massive activations" act as implicit attention biases that pin probability onto certain tokens Do hidden massive activations act as attention bias terms?, a reminder that what passes for internal state in a transformer is improvised, not designed for clean updates.
The thing you didn't know you wanted to know: retrieval and state-update aren't two features on a spectrum — they're a genuine architectural fork. The property that makes attention a perfect lookup table (lossless, fully addressable history) is the exact property that denies it a compact updatable state, and the latest designs don't "fix" attention so much as pair it with a second, compressing memory that does the updating it can't.
Sources 6 notes
Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.