Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans (2501.00663) introduces a neural long-term memory module that addresses a fundamental contradiction in linear recurrent models: they are designed for efficiency on long contexts, but long contexts cannot be properly compressed into small fixed-size states.
The architectural insight is that attention and memory serve fundamentally different functions. Attention operates as short-term memory — accurate direct dependency modeling within the current context window, but quadratic cost limits its reach. Neural memory operates as long-term memory — compressed and persistent, memorizing data that is surprising or close to surprising tokens. The memory update mechanism considers the proportion of memory size to data surprise, resulting in adaptive memory management.
Three integration variants are proposed: memory as context (attending to memory alongside current context), memory as gating (memory modulates attention output), and memory as a layer (memory replaces some attention layers). Each variant trades off between integration depth and computational overhead.
The results establish that Titans outperform both standard Transformers (with the same context window) and modern linear recurrent models across language modeling, common-sense reasoning, genomics, and time series. Critically, Titans scale to context windows larger than 2M tokens while showing competitive performance with Transformers that use the entire context — the long-context problem is addressed without the quadratic penalty. The persistent nature of the memory module makes it a natural substrate for Can models precompute answers before users ask questions? — the memory can store precomputed inferences between interactions, and sleep-time processing can populate the memory with anticipated query-relevant information.
Since Can models reason without generating visible thinking tokens?, the Titans architecture offers a complementary path: rather than scaling reasoning depth through recurrent computation, it scales memory breadth through adaptive memorization. Both bypass the limitations of standard attention but along different architectural dimensions.
Miras unifying framework (2504.13173): The "It's All Connected" paper reconceptualizes Transformers, Titans, and modern linear recurrent models as associative memory modules that learn a mapping of keys to values using an internal objective — termed "attentional bias." The paper observes that most existing sequence models use either dot-product similarity or ℓ2 regression objectives as their attentional bias. Miras provides a general framework with four design choices: (i) associative memory architecture, (ii) attentional bias objective, (iii) retention gate, and (iv) memory learning algorithm. Forgetting mechanisms are reinterpreted as retention regularization — providing a principled basis for forget gates across architectures. Three novel sequence models — Moneta, Yaad, and Memora — go beyond existing linear RNNs while maintaining fast parallelizable training. Different Miras configurations yield models with varying strengths: some excel at language modeling, others at commonsense reasoning or recall-intensive tasks. This generalizes the Titans insight: the attention-as-short-term/memory-as-long-term distinction is one instance of a broader design space where attentional bias objective and retention mechanism can be independently varied.
Inquiring lines that use this note as a source 121
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do users perceive attention from systems that lack continuous temporal presence?
- Can AI ever lead conversations without the anticipatory presence sustained attention provides?
- How does the temporal structure of attention differ between humans and AI?
- Can better attention mechanisms close the gap between human and AI frame-activation?
- How does context collapse affect what language models can meaningfully communicate?
- Can context compression preserve what matters without introducing bias?
- Do retrieval-augmented memory systems actually solve the compartmentalization problem?
- Does transformer attention architecture fundamentally prevent topic-aware memory?
- Can continuum memory systems prevent catastrophic forgetting in neural networks?
- How should memory consolidation timing differ across multiple timescales?
- Why do large language models follow user drift instead of maintaining topic focus?
- Can autoregressive models be trained to produce more cataphoric text?
- What makes multimodal conditioning effective when features are decomposed to the right granularity?
- Why does attention-based drift happen automatically during generation?
- How much does memorization capacity limit a model's ability to learn new information?
- Can transformer attention patterns actually prevent topic context loss in practice?
- What does attentional state look like in a static context window?
- How does circuit complexity limit which grammatical structures transformers can acquire?
- How do attention heads separate text retrieval from internal thought representation?
- How does AI's inability to sustain temporal attention limit its capacity for expert roles?
- Can long-context readers handle compositional tasks or just semantic search?
- Can gradient approximation at equilibrium replace backpropagation through time in practice?
- Can targeted interventions on attention heads bridge the encoding-generation gap?
- How does per-token adaptive compute improve efficiency in recurrent reasoning?
- How does dual-rate learning separate episodic and procedural memory in neural networks?
- Can fast-slow separation improve both memory and generation in language models?
- Why do transformer attention patterns show positional and sequential bias across tasks?
- Do decoder-only models have inherent architectural limits for non-sequential information?
- Can data pruning strategies exploit the finite nature of memorization capacity?
- How do neural memory modules extend context length beyond attention limits?
- Does the prediction unit shape what language models actually learn?
- Can next-token prediction train models to optimize for communication efficiency?
- What attentional bias objectives compete with dot product similarity for associative memory?
- How do retention gates regularize forgetting across different sequence model architectures?
- How do retrieved memories differ from decision-context passages for prediction?
- Do attention scores predict which tokens will be pruned first?
- How does completion-driven KV pruning differ from attention-based cache management?
- Why does attention quality degrade as context length increases?
- How do model priors enable targeted context queries without full attention?
- Why does recency-based recall outperform semantic similarity for episodic memory?
- What persistent memory architectures best support storing precomputed inferences across sessions?
- How should tiny language models be architected differently than large ones?
- What neural or architectural mechanism allows selective override of frequency effects?
- Why does transformer attention architecture undermine stickiness in model behavior?
- What makes multi-session context tracking harder than single-turn underspecification problems?
- Does attention bias explain grounding failure in language models?
- What makes sparse models inefficient to train and deploy at scale?
- Why do cross-product features memorize better than dense embeddings?
- Can the joint-training principle extend beyond memorization and generalization pairs?
- Can a model be strong at MMLU but weak at long-horizon tasks?
- Why do language models fail at iterative numerical optimization despite scale?
- What computational costs does closed-loop memory refinement introduce?
- How does context budget create tradeoffs between memory and skills?
- How can memory shift from a passive datastore to an actively trained component?
- Can autoencoders act as associative memory systems like Hopfield networks?
- What other internal model decisions beyond attention could be optimized directly?
- Why does LLM memory consolidation regress below no-memory baselines?
- Can neural modules memorize surprising tokens as adaptive long-term memory?
- Does conditional memory reduce computation alongside conditional sparsity?
- How do memorization and attention map onto different memory systems?
- How does separating local and global context dependencies affect long-context performance?
- Can memory primitives become first-class design objects like computation sparsity?
- Can episodic raw memory outperform consolidated summaries in practice?
- Why do longer sequences tolerate higher sparsity than shorter ones?
- What mechanisms cause short contexts to degrade more under aggressive sparsity?
- Why do hybrid memory systems outperform single-tier AI architectures?
- What mechanism transfers explicit memories into parametric model weights?
- Can offline recurrent passes replicate sleep-based memory consolidation in AI?
- How does the hippocampus bind disparate elements without storing everything itself?
- Which attention heads are essential for maintaining factuality in sparse models?
- Do pretrained language models carry reusable computational scaffolding for length handling?
- Can zero-weight drift through external memory replace parameter plasticity entirely?
- Why do hybrid memory and compute sparsity outperform pure parameter scaling?
- Can bounded workspaces prevent overthinking better than summarization alone?
- Can test-time scaling compound through memory consolidation into a new scaling law?
- Can memory-based adaptation and gradient fine-tuning operate on complementary timescales?
- What are the scaling law differences between vision and language learning?
- Why does semantic memory abstraction outperform raw episodic recall for personalization?
- Can language models execute iterative numerical methods in latent space?
- How do complementary learning systems explain the need for fast and slow consolidation?
- Can sleep-time compute reduce latency demands during model inference?
- Why is consolidation quality the binding constraint in neural memory systems?
- Can sparse attention methods be designed specifically for multi-hop reasoning tasks?
- How should benchmark design account for task-dependent sparsity tolerance differences?
- What limits the capacity of context-based fast adaptation channels?
- How does transformer attention bias toward repeated and context-prominent content?
- Can models consolidate context into weights during idle offline phases?
- Do long-term memory modules outperform consolidation into fast weights?
- How can a forgetting policy preserve rare knowledge while preventing over-generalization?
- Does including full context always degrade memory retrieval quality in practice?
- Can latent recurrence overcome the trainability costs of depth?
- Do transformer architectures structurally bias models toward short-term optimization?
- How do sleep-time and post-completion methods reduce inference latency?
- Why do language models ignore condensed memory even when it is the only memory?
- Can RL directly optimize attention distributions instead of text generation?
- Do KANs maintain their advantages in deep architectures and large-scale training?
- How should memory systems split between short-term and long-term storage?
- How does attention sink behavior relate to internal model architecture?
- Does recurrent memory or gist compression work better for ultra-long context?
- How do memory hierarchies and compression reduce context management demands?
- How much does sliding-window augmentation improve single-session modeling?
- Can recurrent state mechanisms process longer sequences than attention-based working memory approaches?
- How do adaptive memory modules compare to feedback-based working memory for long context?
- What makes looped latent computation more efficient than scaling attention capacity?
- Why does attending to own latents work better than bolted-on external memory stores?
- What is the theoretical capacity limit before memorization saturates?
- How does disentangled attention separate text from spatial reasoning?
- How should agents compress episodic interactions into working memory without accumulation?
- Why does attention concentrate on the first 25% of long input sequences?
- How do fixed recurrent states trade off copying accuracy for filtering ability?
- Can adaptive memory modules combine long-term filtering with short-term attention benefits?
- What task profiles favor recurrent filtering over scaled attention mechanisms?
- How does externalized state affect the long-context bottleneck in language models?
- Does attention linearity alone explain the efficiency gains over standard transformers?
- Can spiking sparsity replace weight quantization as a primary efficiency lever?
- How does reducing activation precision further extend context length?
- Can attention linearity achieve similar efficiency gains as weight quantization?
- Why do hybrid attention architectures outperform pure linear attention models?
- How do recurrent memory systems handle ultra-long context differently than attention?
- What are the concrete efficiency gains of linear-attention state-space models?
- Can fixed-size latent states losslessly store arbitrary input context?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
complementary architectural innovation: depth-recurrent reasoning vs. breadth-adaptive memory
-
Can models reason without generating visible thinking steps?
Do machine reasoning systems actually require verbalized chains of thought, or can they solve complex problems through hidden computation? This challenges how we measure and understand reasoning.
alternative to verbalized reasoning
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
Titans' persistent neural memory is a natural substrate for sleep-time compute: the memory module can store precomputed inferences between interactions, and sleep-time processing can update the memory with anticipated query patterns; both exploit statefulness to reduce per-query cost
-
Can latent thought vectors scale language models beyond parameters?
Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.
LTMs implement fast-slow dynamics at the generation level (dual-rate learning of thought vs token vectors) complementing Titans' fast-slow at the memory level (attention as short-term, neural memory as long-term)
-
Can lookup memory and computation work together better than either alone?
Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?
Engram adds static O(1) N-gram lookup as a third memory primitive complementing Titans' neural memory; together they suggest memory architecture has multiple co-equal axes (neural-adaptive, static-lookup, attention) rather than a single hierarchy
-
Has memory architecture replaced parameter count as the scaling frontier?
Late-2025 research suggests the field's next major efficiency gains come from restructuring how models store and use experience rather than simply making them larger. Three convergent signals point to this shift.
Titans/Miras is one of three convergent signals (alongside ReasoningBank MaTTS and Engram U-curve) that memory-architecture restructuring has displaced parameter count as the active scaling axis
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Titans: Learning to Memorize at Test Time
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss
- Language Models Need Sleep
- Repeat After Me: Transformers are Better than State Space Models at Copying
- Memorization and Knowledge Injection in Gated LLMs
- Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
- Jamba: A Hybrid Transformer-Mamba Language Model
Original note title
neural memory modules that adaptively memorize surprising tokens complement attention as long-term vs short-term memory — scaling to 2M+ context