SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Model Architecture and Internals

What mechanism enables models to retrieve from long context?

Do attention heads specialize in retrieving relevant information from long context windows, and if so, what makes them universal across models and necessary for factual generation?

Synthesis note · 2026-02-23 · sourced from MechInterp
What kind of thing is an LLM really? How should we allocate compute budget at inference time? How should researchers navigate LLM reasoning research?

Across 4 model families, 6 scales, and 3 types of finetuning, a specific type of attention heads — retrieval heads — are largely responsible for retrieving relevant information from arbitrary locations in long context. Five key properties:

  1. Universal: All explored models with long-context capability have retrieval heads.
  2. Sparse: Less than 5% of attention heads are retrieval heads.
  3. Intrinsic: They already exist in models pretrained with short context. Continual pretraining to 32-128K extends the same set of heads — no new retrieval mechanisms emerge.
  4. Dynamically activated: In Llama-2 7B, 12 retrieval heads always attend to required information regardless of context changes; remaining retrieval heads activate selectively by context.
  5. Causal: Completely pruning retrieval heads causes hallucination; pruning random non-retrieval heads has no effect on retrieval ability.

The CoT connection: retrieval heads strongly influence chain-of-thought reasoning, where the model must frequently refer back to the question and previously-generated context. Tasks where the model directly generates from intrinsic knowledge are less impacted by retrieval head pruning.

This connects the factuality problem to the reasoning architecture: Why does reasoning training help math but hurt medical tasks? describes layer-level separation. Retrieval heads describe head-level specialization within this architecture — a sparse subset of the attention mechanism bridges stored knowledge to ongoing generation.

The practical implication for RAG systems: retrieval heads explain why models can struggle with long-context retrieval despite having the information in context. If retrieval heads are partially activated or not activated for a given needle, the model hallucinates. This is a mechanistic explanation for the Needle-in-a-Haystack failure mode.

Inquiring lines that use this note as a source 30

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 186 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

retrieval heads are a universal sparse intrinsic mechanism for long-context factuality — pruning them causes hallucination