Does conditional memory reduce computation alongside conditional sparsity?
This explores whether adding a learned lookup memory — a 'conditional memory' axis — cuts computation the way conditional sparsity (only activating some experts/parameters) already does, and whether the two combine.
This explores whether memory and sparse computation are two separate levers for getting more out of the same compute budget — and whether pulling both at once beats pulling either alone. The corpus's most direct answer is yes, and it's surprisingly specific: the Engram work treats conditional memory as a *complementary axis* to conditional computation rather than a substitute for it Can lookup memory and computation work together better than either alone?. Bolting an O(1) N-gram lookup onto Mixture-of-Experts routing produces a U-shaped scaling law — balance your budget between cheap lookup and routed computation and you beat pure MoE at equal parameters and FLOPs. The intuition is that lookup handles what can simply be recalled, freeing the expensive routed compute for what actually needs reasoning. Tellingly, the gains showed up most in reasoning and code, not raw retrieval — exactly where you'd want to reserve computation for thinking rather than remembering.
To see why this works, it helps to notice that 'conditional sparsity' isn't one thing. The corpus shows sparsity that the model *learns* on its own: networks grow dense activations for familiar data and stay sparse for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?, and they actively sparsify their hidden states under hard, out-of-distribution tasks as a kind of selective filter Do language models sparsify their activations under difficult tasks?. So computation already routes itself conditionally based on familiarity. A memory module fits naturally alongside this: it absorbs the familiar so the conditional-compute machinery can concentrate where activations are sparse and the work is genuinely novel.
Memory architectures echo the same split. Titans separates short-term attention (quadratic, expensive) from a long-term neural memory that adaptively stores only *surprising* tokens — letting context scale past two million tokens without the quadratic penalty Can neural memory modules scale language models beyond attention limits?. That 'memorize the surprising, don't recompute the routine' move is the memory-side mirror of conditional computation. And there's a deeper reframing worth knowing: at least one line of work argues the long-context bottleneck was never really memory capacity but the *compute* to fold evicted context into the model's internal state Is long-context bottleneck really about memory or compute?. If that's right, then memory and computation aren't even cleanly separable — better memory is partly a way of pre-paying compute.
The lateral lesson the corpus delivers is that sparsity itself is rarely a pure trade-off. Sparse-attention models expand the cost-performance frontier rather than sliding along it — at equal compute, the larger sparse model beats the smaller dense one Does sparse attention trade off quality for speed?. Conditional memory looks like another Pareto-improving axis of the same kind. But the corpus also flags the failure mode: a single model that tries to *generate* compressed memory instead of looking it up follows a fragile inverted-U, eventually degrading below having no memory at all Can a single model replace retrieval for long-term conversation memory?. The savings come from cheap, conditional *recall* — not from spending more computation to manufacture memory. That distinction is the whole game: memory reduces computation when it offloads work, and starts costing you the moment it becomes work.
Sources 7 notes
Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.
COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.