Can fast-slow separation improve both memory and generation in language models?
This explores whether splitting a model into a 'fast' channel (that adapts or remembers quickly) and a 'slow' channel (that holds stable, general knowledge) helps on two fronts at once — holding onto information over long contexts, and producing better output — and the corpus suggests the same fast-slow split keeps reappearing as a general design principle, not a one-off trick.
This explores whether separating a 'fast' adaptive channel from a 'slow' stable one improves both remembering and generating in language models. The corpus says yes — and what's striking is that the same idea surfaces independently across memory architectures, training schemes, and even theories of where the bottleneck actually lives. The clearest statement of the split is Titans Can neural memory modules scale language models beyond attention limits?, which pairs fast quadratic attention (good for the recent past) with a separate long-term neural memory that compresses and stores the *surprising* tokens, letting context stretch past two million tokens without the usual cost. The two channels do different jobs precisely because one is fast and local and the other is slow and consolidated.
On the training side, the same architecture shows up as a way to stop models from forgetting. Fast-Slow Training Can splitting adaptation into two channels reduce forgetting? routes task-specific lessons into fast textual prompts while barely touching the slow weights — reaching the same performance up to three times faster with far less catastrophic forgetting. Its punchline reframes the whole problem: forgetting isn't an unavoidable tax, it's a *misallocation* — you forget when you write transient lessons into the slow store where they don't belong. Latent-Thought models Can latent thought vectors scale language models beyond parameters? make the generation side of this explicit, coupling fast local variational learning with slow global decoder learning, and that dual-rate scheme opens up scaling dimensions independent of raw parameter count.
The most provocative reframing is that the long-context 'memory' problem may not be about memory at all. Research on the long-context bottleneck Is long-context bottleneck really about memory or compute? argues the real cost is the *compute* needed to consolidate evicted context into fast weights during offline 'sleep' phases — and that more consolidation passes keep improving performance, like test-time scaling for harder reasoning. So the fast-slow split isn't just storage hygiene; it's where the model does its thinking between the moment it sees something and the moment it needs to use it.
But the corpus also flags where the split goes wrong. COMEDY Can a single model replace retrieval for long-term conversation memory? collapses memory generation, compression, and response into a single fast operation — no retrieval database — and it works, until continuous reprocessing tips into an inverted-U: past a point the consolidation degrades *below* having no memory at all, through misgrouping and overfitting. The lesson is that fast-slow separation helps only when the boundary is respected; merge the channels too aggressively and you get the worst of both.
What you might not have expected: this whole family of ideas rhymes with a claim about what transformers fundamentally are. The residual-stream view Do transformer models store knowledge or generate it continuously? argues knowledge in these models is *flow*, not storage — closer to oral performance than to a library. If that's right, then bolting on an explicit slow memory store isn't a minor add-on; it's supplying the durable archive the architecture never natively had — which is exactly why separating fast generation from slow memory improves both at once.
Sources 6 notes
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.