INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›How does sequence length affect sp…›this inquiring line

Can a cheap memory lookup and selective compute combine to stretch your AI budget further than either does alone?

Does conditional memory reduce computation alongside conditional sparsity?

This explores whether adding a learned lookup memory — a 'conditional memory' axis — cuts computation the way conditional sparsity (only activating some experts/parameters) already does, and whether the two combine.

This explores whether memory and sparse computation are two separate levers for getting more out of the same compute budget — and whether pulling both at once beats pulling either alone. The corpus's most direct answer is yes, and it's surprisingly specific: the Engram work treats conditional memory as a *complementary axis* to conditional computation rather than a substitute for it Can lookup memory and computation work together better than either alone?. Bolting an O(1) N-gram lookup onto Mixture-of-Experts routing produces a U-shaped scaling law — balance your budget between cheap lookup and routed computation and you beat pure MoE at equal parameters and FLOPs. The intuition is that lookup handles what can simply be recalled, freeing the expensive routed compute for what actually needs reasoning. Tellingly, the gains showed up most in reasoning and code, not raw retrieval — exactly where you'd want to reserve computation for thinking rather than remembering.

To see why this works, it helps to notice that 'conditional sparsity' isn't one thing. The corpus shows sparsity that the model *learns* on its own: networks grow dense activations for familiar data and stay sparse for unfamiliar inputs Is representational sparsity learned or intrinsic to neural networks?, and they actively sparsify their hidden states under hard, out-of-distribution tasks as a kind of selective filter Do language models sparsify their activations under difficult tasks?. So computation already routes itself conditionally based on familiarity. A memory module fits naturally alongside this: it absorbs the familiar so the conditional-compute machinery can concentrate where activations are sparse and the work is genuinely novel.

Memory architectures echo the same split. Titans separates short-term attention (quadratic, expensive) from a long-term neural memory that adaptively stores only *surprising* tokens — letting context scale past two million tokens without the quadratic penalty Can neural memory modules scale language models beyond attention limits?. That 'memorize the surprising, don't recompute the routine' move is the memory-side mirror of conditional computation. And there's a deeper reframing worth knowing: at least one line of work argues the long-context bottleneck was never really memory capacity but the *compute* to fold evicted context into the model's internal state Is long-context bottleneck really about memory or compute?. If that's right, then memory and computation aren't even cleanly separable — better memory is partly a way of pre-paying compute.

The lateral lesson the corpus delivers is that sparsity itself is rarely a pure trade-off. Sparse-attention models expand the cost-performance frontier rather than sliding along it — at equal compute, the larger sparse model beats the smaller dense one Does sparse attention trade off quality for speed?. Conditional memory looks like another Pareto-improving axis of the same kind. But the corpus also flags the failure mode: a single model that tries to *generate* compressed memory instead of looking it up follows a fragile inverted-U, eventually degrading below having no memory at all Can a single model replace retrieval for long-term conversation memory?. The savings come from cheap, conditional *recall* — not from spending more computation to manufacture memory. That distinction is the whole game: memory reduces computation when it offloads work, and starts costing you the moment it becomes work.

Sources 7 notes

Can lookup memory and computation work together better than either alone?

Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Show all 7 sources

Does sparse attention trade off quality for speed?

The Sparse Frontier benchmark shows that at equivalent compute cost, larger sparse-attention models outperform smaller dense models on long-context tasks. Sparsity lets you train bigger models within the same budget, making it Pareto-improving rather than a pure trade-off.

Can a single model replace retrieval for long-term conversation memory?

COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention2.52 match · arxiv ↗
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control2.45 match · arxiv ↗
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs1.79 match · arxiv ↗
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models1.73 match · arxiv ↗
Language Models Need Sleep1.69 match · arxiv ↗
Titans: Learning to Memorize at Test Time1.68 match · arxiv ↗
MiniMax Sparse Attention1.62 match · arxiv ↗
Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-evaluating whether conditional memory and conditional sparsity are synergistic levers for reducing computation. A curated library (papers 2024–2026) found the following—treat these as dated claims, not current truth:

**What a curated library found — and when:**
- Conditional memory (O(1) lookup) complements conditional computation (MoE routing); balancing budget between cheap recall and routed inference beats pure sparsity alone, with largest gains in reasoning and code (2026-01, arXiv:2601.07372).
- Sparsity is learned: models densify on familiar data and sparsify on out-of-distribution tasks as an adaptive filter; memory naturally absorbs the familiar, freeing sparse-compute for novel work (2024–2026 findings across arXiv:2603.03415 et al.).
- Titan-style neural memory that adaptively stores only surprising tokens scales context past 2M without quadratic penalty, mirroring the 'memorize routine, compute novel' trade-off (2024-12, arXiv:2501.00663).
- The long-context bottleneck may be compute cost to fold evicted context back in, not capacity—reframing memory as pre-paid compute rather than separate axis (2026 synthesis).
- Compressive memory that *generates* summaries degrades below baseline; savings come only from cheap *lookup*, not expensive memory construction (2024-02, arXiv:2402.11975).

**Anchor papers (verify; mind their dates):**
- arXiv:2601.07372 (2026-01): Conditional Memory via Scalable Lookup.
- arXiv:2501.00663 (2024-12): Titans: Learning to Memorize at Test Time.
- arXiv:2603.03415 (2026-03): Farther the Shift, Sparser the Representation (OOD sparsity).
- arXiv:2504.17768 (2025-04): The Sparse Frontier: Sparse Attention Trade-offs.

**Your task:**
(1) **RE-TEST the complementarity claim.** For each constraint above—lookup efficiency, learned sparsity, surprising-token storage, pre-paid-compute framing, compressive-memory failure—probe whether newer inference orchestration (batching, caching, multi-token speculation), better quantization, or emerging MoE variants have since *merged* these axes or revealed that one fully subsumes the other. Separate what's still open (does lookup remain O(1) under real-world scales?) from what may be resolved.
(2) **Surface strongest contradicting work** from the last ≤6 months: any paper arguing memory and sparsity compete rather than synergize, or showing unified architectures that dissolve the distinction entirely.
(3) **Propose 2 research questions** assuming the regime has shifted: (a) If memory becomes *learned* rather than lookup-based, does it still reduce computation, or does learning cost dominate? (b) Do hybrid sparse–memory architectures trained end-to-end beat modular combinations, and if so, what's the minimal joint objective?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can a cheap memory lookup and selective compute combine to stretch your AI budget further than either does alone?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8