How do attention heads separate text retrieval from internal thought representation?
This explores whether the transformer separates two jobs — pulling facts out of the surrounding text versus carrying its own reasoning forward — into different parts of the attention machinery, and what the corpus knows about that division of labor.
This explores whether the transformer keeps two jobs apart — fetching facts from the text in front of it versus holding and advancing its own internal reasoning — and the corpus suggests the answer is yes, but the boundary is messier and more revealing than a clean split. The clearest evidence for separation comes from retrieval heads: fewer than 5% of all attention heads, consistent across model families, do the work of copying a fact from long context into the answer, and they are causally necessary for factuality — prune them and the model hallucinates even though the information is sitting right there in the prompt What mechanism enables models to retrieve from long context?. So 'text retrieval' isn't smeared across the whole network; it's concentrated in a sparse, identifiable set of heads that activate based on context. That's the retrieval side of your question made concrete.
The 'internal thought' side shows up in a different signature. Reasoning doesn't live in special heads so much as in special moments — particular tokens like 'Wait,' 'Therefore,' and other reflection cues spike in mutual information with the correct answer, and suppressing exactly those tokens damages reasoning while suppressing the same number of random tokens does nothing Do reflection tokens carry more information about correct answers?. One way to read your question across these two notes: retrieval is a *who* (which heads), while internal thought is a *when* (which transition points carry the load). The model isn't routing both through the same channel labeled differently — they have genuinely different fingerprints.
But the separation leaks, and that's the part worth knowing. Chain-of-thought reasoning is constantly contaminated by a third thing — memorization — where the model leans on preceding tokens instead of actually reasoning, and this 'local memorization' drives up to 67% of reasoning errors Where do memorization errors arise in chain-of-thought reasoning?. So the boundary between 'retrieving something I saw' and 'thinking it through' isn't policed cleanly; the retrieval-like pull of nearby tokens can masquerade as thought. Reinforcing that, soft attention is structurally biased to over-weight repeated and context-prominent tokens regardless of whether they're relevant Does transformer attention architecture inherently favor repeated content? — meaning the same mechanism that makes retrieval work also makes the model grab loud surface material when it should be reasoning past it.
The most interesting cross-domain move in the corpus is architectural: some researchers stop trying to make one attention mechanism do both jobs and physically split them. The Titans design separates short-term attention from a long-term neural memory module that decides which surprising tokens are worth storing, scaling past 2M tokens without the quadratic cost Can neural memory modules scale language models beyond attention limits?. And DeepRAG reframes the whole question as a decision the model makes step by step — at each reasoning step, retrieve from outside or rely on internal parametric knowledge — and gets a 22% accuracy jump largely by *not* retrieving when internal knowledge suffices When should language models retrieve external knowledge versus use internal knowledge?. That's your question turned into an engineering choice: rather than discovering the separation inside the heads, build it in explicitly.
The thing you might not have known you wanted: the same sparse attention machinery that makes retrieval reliable is also what makes 'thinking' fragile — the model's tendency to over-attend to prominent, repeated, or nearby tokens is simultaneously the basis of good fact-copying and the source of memorization errors that corrupt reasoning. Retrieval and internal thought aren't separated by a wall; they're two settings of one biased dial, which is exactly why so much recent work tries to give the model an explicit switch between them.
Sources 6 notes
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Transformer soft attention systematically over-weights repeated and context-prominent tokens regardless of relevance, creating a positive feedback loop that amplifies opinions and framing before RLHF acts. System 2 Attention—regenerating context to remove irrelevant material—can interrupt this mechanism.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.