Can lookup memory and computation work together better than either alone?
Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?
Transformers have one sparsity primitive: conditional computation via Mixture-of-Experts, where dynamic logic routes through sparsely activated parameters. Engram (2601.07372) argues this is incomplete. Knowledge has a different shape from logic. Static facts ("Jacobi was born in 1804") are not dynamic logic; they are key-value lookups. Forcing them through computation wastes capacity on simulating retrieval.
Engram introduces conditional memory as the missing sparsity axis. The instantiation is a modernized N-gram embedding table — local context as key, indexed via constant-time O(1) lookups into a massive embedding store. The modernizations matter: tokenizer compression, multi-head hashing, contextualized gating, multi-branch integration. Classical N-grams failed because they could not compose; these adaptations make them composable with the surrounding transformer.
The surprising empirical result is a U-shaped scaling law in sparsity allocation. At iso-parameter and iso-FLOPs budgets, pure MoE underperforms hybrid MoE+Engram allocations, and pure Engram also underperforms. There is an optimum: some capacity should go to conditional computation (logic), some to conditional memory (lookup). The curve has a single minimum loss; sliding too far in either direction degrades performance.
More surprising: the largest gains are not in knowledge retrieval (MMLU +3.4, CMMLU +4.0) but in general reasoning (BBH +5.0, ARC-Challenge +3.7) and code/math (HumanEval +3.0, MATH +2.4). The mechanistic interpretation: Engram relieves the backbone's early layers from "static reconstruction" — the labor of approximating N-gram statistics through attention and MLPs. With that labor offloaded, early layers can be repurposed for deeper composition. Effectively, Engram deepens the network without adding layers, by freeing parameters to do less local work.
The long-context implication is striking. By delegating local dependencies to lookups, attention capacity is freed for global context. Multi-Query NIAH retrieval rises from 84.2 to 97.0. This suggests the long-context bottleneck is not pure context length but attention's dual burden: it must simultaneously do local approximation and global integration. Separating those labors helps both.
The architectural framing — sparsity has multiple axes, computation and memory are complementary — sets up "memory primitives" as first-class design objects for next-generation sparse models. Most prior memory-augmented work treated external memory as a workaround for parametric limits; Engram positions conditional memory as a co-equal primitive.
Inquiring lines that use this note as a source 30
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do MIPS algorithms constrain the choice of similarity functions?
- How do the six memory components combine across explicit and implicit paths?
- How should we allocate compute between reasoning and retrieval iterations?
- Can model routing and compute allocation work together as independent optimizations?
- How does inference compute substitution affect the training parameter scaling trade-off?
- What decomposition level minimizes both error rate and computational cost in practice?
- Can neural networks implement genuine algorithms or only statistical pattern matching?
- Which RAG sub-decisions are actually pattern matching versus reasoning intensive?
- How do conditional scaling laws incorporate hardware into architecture choices?
- How do cortical columns implement local inference over memory cycles?
- Do models excel at reasoning depth or memory breadth when scaling test time compute?
- Why do embedding table lookups become memory-bound bottlenecks at scale?
- Where does inference compute stop substituting for model capacity?
- Can compute allocation and model routing be combined for better results?
- Can steering vectors be combined with other compression techniques?
- How do sparse circuits compare to the modular subnetworks that emerge naturally?
- What makes sparse models inefficient to train and deploy at scale?
- How does context budget create tradeoffs between memory and skills?
- How can memory shift from a passive datastore to an actively trained component?
- Does conditional memory reduce computation alongside conditional sparsity?
- Can memory primitives become first-class design objects like computation sparsity?
- Why do hybrid memory systems outperform single-tier AI architectures?
- Which attention heads are essential for maintaining factuality in sparse models?
- Why do hybrid memory and compute sparsity outperform pure parameter scaling?
- How does mixture of experts enable flexible capacity sharing between modalities?
- Can sparse attention methods be designed specifically for multi-hop reasoning tasks?
- How do sparse mixture-of-experts models resolve modality capacity competition?
- Does ternary weight quantization simplify deployment of mixture of experts?
- What makes mixture-of-experts routing learn token-level specialization effectively?
- Why does Branch-Train-Merge fail without learned routing between experts?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can retrieval knowledge compress into a tiny parametric model?
Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.
Memory Decoder compresses non-parametric retrieval into a parametric module; Engram is the inverse direction — adding lookup primitive to parametric models
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans/Miras add neural memory as architectural component; Engram is the static-lookup analog, complementing rather than competing
-
Can recursive subtask trees overcome context window limits?
Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
TIMRUN frees attention from long-history burden via pruning; Engram frees attention from local-statistics burden via lookup; both reframe attention's job
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
- Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models
Original note title
conditional memory is a complementary sparsity axis to conditional computation — hybrid lookup plus MoE beats pure MoE at iso-parameter and iso-FLOPs