SYNTHESIS NOTE
Model Architecture and Internals Training, RL, and Test-Time Scaling

Can state-space models match transformers at copying and retrieval?

Explores whether the efficiency gains of state-space models come at a fundamental cost in their ability to copy strings and retrieve exact information from context, compared to transformers.

Synthesis note · 2026-06-03 · sourced from Novel Architectures

The efficiency case for generalized state-space models (GSSMs — S4, Mamba, linear attention, parallel RNNs) is that they use an O(1) fixed-size latent state instead of the transformer's Ω(L) memory. This paper asks what that buys is paid for, and proves a sharp limit: a two-layer transformer can copy strings of exponential length, while GSSMs are fundamentally bounded by their fixed-size state. Empirically, transformers beat GSSMs at copying and context-retrieval on synthetic tasks, and pretrained transformer LLMs dramatically outperform state-space LLMs at copying and retrieving information from context.

The keeper is the mechanism-level trade-off: a fixed-size memory cannot losslessly hold arbitrary context, so any task that requires reproducing or retrieving from the input verbatim has a hard ceiling for GSSMs that transformers don't face. This is the precise capability cost of the efficiency that makes linear-attention architectures attractive. The authors' constructive suggestion — hybrid architectures that give SSMs an attention-like retrieval mechanism — is now the dominant design response.

This grounds the efficiency-vs-capability tension in the vault's architecture thread. It is the cautionary counterweight to Can spiking neurons make transformers efficient on any hardware? — linear/spiking attention buys efficiency, but this proof says the fixed state pays for it in copying and retrieval, which is why SpikingBrain and others use hybrid-linear rather than pure-linear attention.

Inquiring lines that use this note as a source 13

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
12 direct connections · 83 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

transformers provably beat state-space models at copying and retrieving from context because a fixed-size latent state cannot