SYNTHESIS NOTE

Can lookup memory and computation work together better than either alone?

Mixture-of-Experts handles dynamic logic, but static knowledge might need a different mechanism. Can a hybrid approach combining conditional computation with fast lookup outperform pure sparse models?

Synthesis note · 2026-05-18 · sourced from Memory

Transformers have one sparsity primitive: conditional computation via Mixture-of-Experts, where dynamic logic routes through sparsely activated parameters. Engram (2601.07372) argues this is incomplete. Knowledge has a different shape from logic. Static facts ("Jacobi was born in 1804") are not dynamic logic; they are key-value lookups. Forcing them through computation wastes capacity on simulating retrieval.

Engram introduces conditional memory as the missing sparsity axis. The instantiation is a modernized N-gram embedding table — local context as key, indexed via constant-time O(1) lookups into a massive embedding store. The modernizations matter: tokenizer compression, multi-head hashing, contextualized gating, multi-branch integration. Classical N-grams failed because they could not compose; these adaptations make them composable with the surrounding transformer.

The surprising empirical result is a U-shaped scaling law in sparsity allocation. At iso-parameter and iso-FLOPs budgets, pure MoE underperforms hybrid MoE+Engram allocations, and pure Engram also underperforms. There is an optimum: some capacity should go to conditional computation (logic), some to conditional memory (lookup). The curve has a single minimum loss; sliding too far in either direction degrades performance.

More surprising: the largest gains are not in knowledge retrieval (MMLU +3.4, CMMLU +4.0) but in general reasoning (BBH +5.0, ARC-Challenge +3.7) and code/math (HumanEval +3.0, MATH +2.4). The mechanistic interpretation: Engram relieves the backbone's early layers from "static reconstruction" — the labor of approximating N-gram statistics through attention and MLPs. With that labor offloaded, early layers can be repurposed for deeper composition. Effectively, Engram deepens the network without adding layers, by freeing parameters to do less local work.

The long-context implication is striking. By delegating local dependencies to lookups, attention capacity is freed for global context. Multi-Query NIAH retrieval rises from 84.2 to 97.0. This suggests the long-context bottleneck is not pure context length but attention's dual burden: it must simultaneously do local approximation and global integration. Separating those labors helps both.

The architectural framing — sparsity has multiple axes, computation and memory are complementary — sets up "memory primitives" as first-class design objects for next-generation sparse models. Most prior memory-augmented work treated external memory as a workaround for parametric limits; Engram positions conditional memory as a co-equal primitive.

Inquiring lines that read this note 32

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do semantic similarity and task relevance diverge in vector embeddings?

What memory architectures best support persistent reasoning across extended interactions?

How should inference compute be adaptively allocated based on prompt difficulty?

How should we allocate compute between reasoning and retrieval iterations?

Can model routing outperform monolithic scaling as an efficiency strategy?

Can inference-time compute substitute for scaling up model parameters?

How does example difficulty affect learning efficiency in language models?

What decomposition level minimizes both error rate and computational cost in practice?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Can neural networks implement genuine algorithms or only statistical pattern matching?

How do knowledge injection methods compare across cost and effectiveness?

Which RAG sub-decisions are actually pattern matching versus reasoning intensive?

Do autonomous architecture discoveries follow predictable scaling laws?

How do conditional scaling laws incorporate hardware into architecture choices?

What role does compression play in language model capability and generalization?

What limits mechanistic interpretability's ability to characterize models?

How do sparse circuits compare to the modular subnetworks that emerge naturally?

How does sequence length affect sparsity tolerance in models?

How do transformer attention mechanisms implement memory and algorithmic functions?

Which attention heads are essential for maintaining factuality in sparse models?

What articulatory information do speech signals carry that text cannot?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Can lookup memory and computation work together … Can retrieval knowledge compress into a tiny param… Can neural memory modules scale language models be… Can recursive subtask trees overcome context windo…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can retrieval knowledge compress into a tiny parametric model? Can the information stored in large non-parametric retrieval datastores be compressed into a small trainable module? This matters because it could combine retrieval's knowledge benefits with the speed of pure parametric methods.
Memory Decoder compresses non-parametric retrieval into a parametric module; Engram is the inverse direction — adding lookup primitive to parametric models
Can neural memory modules scale language models beyond attention limits? Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans/Miras add neural memory as architectural component; Engram is the static-lookup analog, complementing rather than competing
Can recursive subtask trees overcome context window limits? Explores whether modeling reasoning as prunable trees of subtasks could eliminate the context length constraints that currently force developers into multi-agent architectures. Asks if working memory can become truly unlimited through selective KV cache retention.
TIMRUN frees attention from long-history burden via pruning; Engram frees attention from local-statistics burden via lookup; both reframe attention's job

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

conditional memory is a complementary sparsity axis to conditional computation — hybrid lookup plus MoE beats pure MoE at iso-parameter and iso-FLOPs

Can lookup memory and computation work together better than either alone?

Inquiring lines that read this note 32

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4