Can recurrent memory scale where attention fails on ultra-long text?
GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?
BABILong is a leak-proof benchmark for extracting and processing facts distributed across very long texts (length and placement are algorithmically adjustable, so future LLMs can't have memorized it). Two findings stand out. First, common methods — including GPT-4 and RAG — are effective only for sequences up to ~10⁴ elements, and their performance relies heavily on the first 25% of the input: a stark quantification of the lost-in-the-middle problem, where attention effectively ignores the bulk of a long context. Second, fine-tuning a small GPT-2 with recurrent memory augmentation lets it handle up to 11 million tokens — by far the longest input processed by any neural model — and crucially enables multi-hop reasoning by filtering irrelevant information rather than attending over everything.
The keeper is the comparative claim: recurrent memory excels at filtering irrelevant content in a way that scaling attention does not. Where attention degrades and concentrates on the start of the input, a compact recurrent state forces the model to decide what to carry forward — and that selectivity is what unlocks ultra-long multi-hop reasoning.
This complements the vault's long-context thread from the memory side. It pairs with Can neural memory modules scale language models beyond attention limits? (Titans) as another recurrent-memory route past attention's limits, and it is the empirical lost-in-the-middle ground for How do LLMs balance remembering context versus keeping it separate?. It also sits in productive tension with Can state-space models match transformers at copying and retrieval?: a fixed recurrent state is worse at verbatim copying yet better at filtering for multi-hop fact extraction — the task profile decides which wins.
Inquiring lines that use this note as a source 4
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does recurrent memory or gist compression work better for ultra-long context?
- Can recurrent state mechanisms process longer sequences than attention-based working memory approaches?
- What task profiles favor recurrent filtering over scaled attention mechanisms?
- How do recurrent memory systems handle ultra-long context differently than attention?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
sibling recurrent-memory route past attention's context limits
-
Can state-space models match transformers at copying and retrieval?
Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.
tension: fixed state loses at copying but wins at filtering for multi-hop extraction; task profile decides
-
How do LLMs balance remembering context versus keeping it separate?
LLMs face a structural tension: retaining too much context causes different threads to blur together, while retaining too little causes the model to lose track of earlier commitments. This explores whether this dilemma is fundamental to how transformers work.
the first-25%-reliance finding is concrete evidence of the long-context degradation that note frames
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
- Language Models Need Sleep
- Memorization and Knowledge Injection in Gated LLMs
- Titans: Learning to Memorize at Test Time
- Repeat After Me: Transformers are Better than State Space Models at Copying
- Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- Localizing Paragraph Memorization in Language Models
Original note title
recurrent memory augmentation processes eleven million tokens while LLMs and RAG rely on the first quarter of input — recurrent memory beats attention at filtering ultra-long context