SYNTHESIS NOTE

Can state-space models match transformers at copying and retrieval?

Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.

Synthesis note · 2026-06-03 · sourced from Novel Architectures

The efficiency case for generalized state-space models (GSSMs — S4, Mamba, linear attention, parallel RNNs) is that they use an O(1) fixed-size latent state instead of the transformer's Ω(L) memory. This paper asks what that buys is paid for, and proves a sharp limit: a two-layer transformer can copy strings of exponential length, while GSSMs are fundamentally bounded by their fixed-size state. Empirically, transformers beat GSSMs at copying and context-retrieval on synthetic tasks, and pretrained transformer LLMs dramatically outperform state-space LLMs at copying and retrieving information from context.

The keeper is the mechanism-level trade-off: a fixed-size memory cannot losslessly hold arbitrary context, so any task that requires reproducing or retrieving from the input verbatim has a hard ceiling for GSSMs that transformers don't face. This is the precise capability cost of the efficiency that makes linear-attention architectures attractive. The authors' constructive suggestion — hybrid architectures that give SSMs an attention-like retrieval mechanism — is now the dominant design response.

This grounds the efficiency-vs-capability tension in the vault's architecture thread. It is the cautionary counterweight to Can spiking neurons make transformers efficient on any hardware? — linear/spiking attention buys efficiency, but this proof says the fixed state pays for it in copying and retrieval, which is why SpikingBrain and others use hybrid-linear rather than pure-linear attention.

Inquiring lines that read this note 25

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

What role does compression play in language model capability and generalization?

What memory architectures best support persistent reasoning across extended interactions?

How should retrieval systems optimize for multi-step reasoning during inference?

Does externalizing cognitive work and state improve agent reliability?

Can externalizing bookkeeping to a stateful harness replace internalized memory control?

How does reasoning graph topology affect breakthrough insights and generalization?

How does structured environment state compare to transcript replay for multi-turn reasoning?

How do transformer attention mechanisms implement memory and algorithmic functions?

How does sequence length affect sparsity tolerance in models?

What are the concrete efficiency gains of linear-attention state-space models?

Can next-token prediction alone produce genuine language understanding?

Does latent manipulation outperform token-level prediction for efficiency?

Which computational strategies best support reasoning in language models?

Can a trained decoder replace both search and parameter updates?

Why do semantic similarity and task relevance diverge in vector embeddings?

Why do encoder models process document corpora more efficiently than decoder models?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 93 in 2-hop network ·medium cluster Open in graph ↗

Can state-space models match transformers at cop… Can spiking neurons make transformers efficient on… Can recurrent memory scale where attention fails o…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can spiking neurons make transformers efficient on any hardware? Explores whether brain-inspired spiking mechanisms combined with linear attention can adapt existing transformer checkpoints into efficient models trainable outside NVIDIA ecosystems using minimal additional data.
the efficiency play this proof bounds; explains why hybrid-linear beats pure-linear
Can recurrent memory scale where attention fails on ultra-long text? GPT-4 and RAG plateau around 10,000 tokens and rely heavily on the first quarter of input. Can recurrent memory augmentation overcome these limits and enable reasoning across millions of tokens?
counterpoint on the long-context axis: recurrent memory can win where attention degrades, but at a different task profile

Can state-space models match transformers at copying and retrieval?

Inquiring lines that read this note 25

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4