How does disentangled attention separate text from spatial reasoning?
This reads as a question about architectures (like DeBERTa) that split attention into separate channels — one for what a token says, another for where it sits — but the corpus doesn't hold a paper on that specific mechanism, so the honest answer is to show what it *does* have about attention specializing into distinct functional roles.
This explores 'disentangled attention' — the architectural idea (most famous in DeBERTa) that a model can compute attention over a token's content separately from its position, so spatial relationships don't get tangled up with meaning. The collection doesn't contain a paper on that exact mechanism, so rather than pad, here's the adjacent territory it covers well: how attention inside a transformer pulls apart into specialized roles, which is the same instinct disentangled attention is built on.
The strongest parallel is the finding that only a tiny slice of attention heads — under 5% — act as dedicated 'retrieval heads' that fetch facts from long context, and that pruning them causes hallucination even when the information is sitting right there What mechanism enables models to retrieve from long context?. That's disentanglement discovered after the fact: the model spontaneously dedicates specific circuits to a specific job. A related result shows memorized text is handled by a particular low-layer attention head that locks onto rare tokens Where does a model store memorized paragraphs?. In both cases attention isn't one undifferentiated soup — different heads quietly specialize.
Where DeBERTa separates channels by design, the Titans architecture separates them at the module level: short-term attention (fast, expensive, quadratic) is split off from a long-term neural memory that compresses and stores 'surprising' tokens, letting the model scale past two million tokens of context Can neural memory modules scale language models beyond attention limits?. That's the same move as disentanglement — give distinct functions distinct machinery rather than forcing one attention mechanism to do everything.
The collection also explains *why* you'd want to disentangle in the first place. Plain soft attention is structurally biased toward whatever is repeated or prominent in the context, regardless of whether it's relevant — a feedback loop that amplifies framing and feeds sycophancy Does transformer attention architecture inherently favor repeated content?. And a handful of 'massive activations' act as fixed, input-agnostic bias terms that dump attention onto particular tokens Do hidden massive activations act as attention bias terms?. Both are cases of attention failing to cleanly separate signal from artifact — exactly the failure disentangled designs try to engineer away.
So if your real interest is text-versus-spatial specifically, the corpus can't take you there. But if it's the broader question — *can attention be made to handle distinct kinds of information through distinct pathways?* — the answer the collection keeps returning is yes, sometimes by design (Titans) and sometimes as an emergent property worth finding and protecting (retrieval heads).
Sources 5 notes
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
Memorized paragraphs leave a distinctive fingerprint in GPT-Neo: larger gradients in lower layers, concentration in a specific low-layer attention head attending to rare tokens, and dependence on a few early-prefix tokens. This localization makes memorization targetable for unlearning.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
A very small number of input-agnostic activations with values up to 100,000× larger than others act as indispensable implicit bias terms and concentrate attention probability onto specific tokens. This phenomenon appears across model sizes and Vision Transformers.