How do fixed recurrent states trade off copying accuracy for filtering ability?
This explores a structural tension in models that compress everything into a fixed-size memory (state-space models, RNNs, neural memory modules): the same bottleneck that forces them to *forget and filter* is what stops them from *copying verbatim*.
This explores a structural tension in models that carry a fixed-size recurrent state: the same compression that lets them filter and summarize is what prevents them from copying long sequences exactly. The cleanest statement of the cost side comes from a proof that two-layer transformers can copy exponentially long strings while state-space models cannot — because an SSM has to cram the whole past into a latent vector of bounded size, it provably loses the ability to retrieve arbitrary earlier tokens Can state-space models match transformers at copying and retrieval?. Copying is the worst case for a fixed state: it demands you preserve *everything*, which is exactly what a bottleneck refuses to do.
But flip the framing and the bottleneck becomes a feature. Filtering — deciding what's worth keeping — is the whole point of a compressed state, and recent work suggests models do this adaptively rather than uniformly. Hidden states sparsify under out-of-distribution or hard inputs, and that sparsification looks like a deliberate selective filter that stabilizes performance rather than a failure Do language models sparsify their activations under difficult tasks?. The complementary finding is that this filtering behavior is *learned*: networks build dense representations for familiar data and fall back to sparse ones for the unfamiliar, a kind of consolidation through exposure Is representational sparsity learned or intrinsic to neural networks?. So the fixed state isn't just lossy — it's lossy in a shaped, information-prioritizing way.
The interesting architectural moves try to refuse the trade-off rather than accept it. Titans bolts a long-term neural memory module onto attention and explicitly prioritizes *surprising* tokens for storage — letting attention handle exact short-range copying while the compressed memory keeps a filtered digest of the far past, scaling past 2M tokens without quadratic cost Can neural memory modules scale language models beyond attention limits?. That's the trade-off made into a division of labor: precise-but-expensive retrieval for what's near, lossy-but-cheap filtering for what's far. Hierarchical recurrence does something adjacent in the depth dimension, coupling a slow planning loop with a fast computation loop so a small recurrent model reaches reasoning that fixed-depth transformers can't Can recurrent hierarchies achieve reasoning that transformers cannot?.
There's a deeper reframe lurking here worth the detour. One line of work argues transformers don't really *store* knowledge in retrievable slots at all — the residual stream transmits knowledge as continuous flow, more like an oral performance than a written archive, which is precisely why model knowledge is contextual and hard to edit Do transformer models store knowledge or generate it continuously?. If that's right, then "copying accuracy vs. filtering" isn't an SSM-specific defect; it's a sharper version of a tension every neural sequence model lives inside. The fixed recurrent state just makes the cost legible by putting a hard wall on how much can flow through.
The takeaway a curious reader might not have expected: copying and filtering aren't two tasks a model happens to be good or bad at — they're the two ends of a single dial set by how much state you allow. Give the model unbounded retrievable context (attention) and it copies perfectly but pays quadratically; squeeze it into a fixed vector and it filters elegantly but cannot quote you back. The frontier work isn't choosing a point on that dial — it's building two memories so you don't have to.
Sources 6 notes
Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.