Can recurrent memory scale where attention fails on ultra-long text?

GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?

Synthesis note · 2026-06-03 · sourced from RAG

BABILong is a leak-proof benchmark for extracting and processing facts distributed across very long texts (length and placement are algorithmically adjustable, so future LLMs can't have memorized it). Two findings stand out. First, common methods — including GPT-4 and RAG — are effective only for sequences up to ~10⁴ elements, and their performance relies heavily on the first 25% of the input: a stark quantification of the lost-in-the-middle problem, where attention effectively ignores the bulk of a long context. Second, fine-tuning a small GPT-2 with recurrent memory augmentation lets it handle up to 11 million tokens — by far the longest input processed by any neural model — and crucially enables multi-hop reasoning by filtering irrelevant information rather than attending over everything.

The keeper is the comparative claim: recurrent memory excels at filtering irrelevant content in a way that scaling attention does not. Where attention degrades and concentrates on the start of the input, a compact recurrent state forces the model to decide what to carry forward — and that selectivity is what unlocks ultra-long multi-hop reasoning.

This complements the vault's long-context thread from the memory side. It pairs with Can neural memory modules scale language models beyond attention limits? (Titans) as another recurrent-memory route past attention's limits, and it is the empirical lost-in-the-middle ground for How do LLMs balance remembering context versus keeping it separate?. It also sits in productive tension with Can state-space models match transformers at copying and retrieval?: a fixed recurrent state is worse at verbatim copying yet better at filtering for multi-hop fact extraction — the task profile decides which wins.

Inquiring lines that read this note 4

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What role does compression play in language model capability and generalization?

Does recurrent memory or gist compression work better for ultra-long context?

What memory architectures best support persistent reasoning across extended interactions?

How does sequence length affect sparsity tolerance in models?

What task profiles favor recurrent filtering over scaled attention mechanisms?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 118 in 2-hop network ·dense cluster Open in graph ↗

Can recurrent memory scale where attention fails… Can neural memory modules scale language models be… Can state-space models match transformers at copyi… How do LLMs balance remembering context versus kee…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can neural memory modules scale language models beyond attention limits? Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
sibling recurrent-memory route past attention's context limits
Can state-space models match transformers at copying and retrieval? Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.
tension: fixed state loses at copying but wins at filtering for multi-hop extraction; task profile decides
How do LLMs balance remembering context versus keeping it separate? LLMs face a structural tension: retaining too much context causes different threads to blur together, while retaining too little causes the model to lose track of earlier commitments. This explores whether this dilemma is fundamental to how transformers work.
the first-25%-reliance finding is concrete evidence of the long-context degradation that note frames

Can recurrent memory scale where attention fails on ultra-long text?

Inquiring lines that read this note 4

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4