Which attention heads are essential for maintaining factuality in sparse models?
This explores whether specific attention heads—rather than the whole network—carry the burden of factual recall, and what happens to them when models lean on sparse attention to handle long contexts.
This explores whether factuality lives in a few identifiable attention heads (vs. being spread diffusely across the model), and what that means once you make attention sparse. The corpus has a sharp answer: a tiny minority of heads do almost all the work. Research on retrieval heads finds that fewer than 5% of attention heads across every model family act as the mechanism that pulls a fact out of long context and into the answer — and they're causally necessary, not just correlated. Prune them and the model hallucinates even when the correct information is sitting right there in the context window What mechanism enables models to retrieve from long context?. So the question's premise is right: factuality in a long-context model is held up by a sparse, identifiable scaffold of heads, and knowing which ones lets you predict exactly where it breaks.
The interesting twist is that these retrieval heads are themselves a *sparse* mechanism, which reframes what 'sparse attention' is doing. The Sparse Frontier work shows sparse-attention models aren't trading quality for speed — at equal compute, a bigger sparse model beats a smaller dense one on long-context tasks Does sparse attention trade off quality for speed?. Read alongside the retrieval-head finding, the reason becomes intuitive: if only a sliver of heads is essential for factual recall anyway, then aggressively sparsifying attention is cheap *as long as you don't prune the heads that matter*. The danger isn't sparsity itself — it's blind sparsity that severs the retrieval scaffold.
There's also a deeper pattern: sparsity in these models seems to be where the model signals difficulty and unfamiliarity. Hidden states sparsify in a systematic, localized way precisely when a task is out-of-distribution, acting as a stabilizing filter rather than a failure Do language models sparsify their activations under difficult tasks?, and that sparse-when-unfamiliar / dense-when-familiar split is learned during pretraining as the model consolidates what it actually knows Is representational sparsity learned or intrinsic to neural networks?. The takeaway for factuality: a model's representations are dense where it has knowledge and sparse where it's reaching — so the heads that survive under pressure are a kind of map of what the model can reliably retrieve.
Finally, the corpus suggests an architectural escape hatch when the head-based mechanism hits its limits. Rather than overloading attention with long-range recall, some designs split memory off entirely — Titans gives the model a separate neural memory module that adaptively stores 'surprising' tokens, letting attention stay short-range while a dedicated long-term store handles recall past two million tokens Can neural memory modules scale language models beyond attention limits?. The same complementary instinct shows up in pairing O(1) lookup memory with sparse expert routing, where balancing the two beats either alone Can lookup memory and computation work together better than either alone?. The throughline you might not have expected: factuality isn't a property of the whole network — it's concentrated in a few heads or offloaded to a dedicated memory, and the design question is whether you protect those heads or build something separate to do their job.
Sources 6 notes
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.