INQUIRING LINE

How do transformers compare to state-space models on copying and retrieval?

This explores the architectural showdown between transformers and state-space models (SSMs like Mamba) specifically on copying long strings and pulling facts back out of context — and why the gap exists.


This explores the architectural showdown between transformers and state-space models on copying and retrieval — and the corpus has a sharp, almost surprising verdict: it isn't close, and there's a provable reason why. Transformers beat SSMs at copying and retrieving from context, and the proof is mechanical rather than empirical hand-waving: even a two-layer transformer can copy exponentially long strings, while an SSM is fundamentally bottlenecked by its fixed-size latent state Can state-space models match transformers at copying and retrieval?. An SSM compresses everything it has seen into one rolling vector of bounded size; once the thing you need to retrieve exceeds what that vector can hold, the information is simply gone. A transformer, by contrast, keeps every past token addressable through attention, so it can reach back and grab the exact span it needs. Copying is the cleanest possible stress test of that difference, which is why it shows up as the decisive benchmark.

The interesting turn is that this transformer 'win' is the same property that makes its knowledge slippery in other settings. One note argues transformers carry knowledge as a continuous flow through the residual stream rather than as items in fixed storage — closer to oral performance than to a database lookup Do transformer models store knowledge or generate it continuously?. So the retrieval advantage over SSMs is specifically about retrieving from the *active context window*, not about retrieving stored facts from weights. Attention gives you a kind of working-memory random-access that the SSM's compressed state can't match; neither architecture turns parametric knowledge into a clean archive.

That reframes the whole comparison as a question about where you pay for memory. The fixed-state bottleneck that hurts SSMs at copying is mirrored by research showing the long-context bottleneck for transformers is really about *compute* — the cost of consolidating evicted context into internal state — not raw memory capacity Is long-context bottleneck really about memory or compute?. Both architectures eventually have to compress; they just differ on when and how violently. Related work on replacing retrieval with a single model that continuously regenerates a compressed memory shows the danger of over-compression directly: performance follows an inverted-U and can drop below having no memory at all when consolidation misgroups or loses context Can a single model replace retrieval for long-term conversation memory?. That's essentially the SSM failure mode generalized — squeeze context into a small state and exact retrieval degrades.

There's a deeper structural echo here too. Transformers' lack of a native recurrent state is exactly why they lean on explicit chain-of-thought: with no place to keep evolving state internally, they push it deeper through layers until depth runs out, then externalize it into tokens Why do transformers need explicit chain-of-thought reasoning?. SSMs *do* have recurrent state — which is what should make them elegant — yet that same compact state is what costs them at copying. So the two architectures sit on opposite horns: SSMs have recurrence but can't randomly access the past; transformers can randomly access the past but have no persistent recurrent register. Copying-and-retrieval is just the task that exposes the transformer's edge most starkly.

If you want to push further, it's worth noticing that retrieval failures aren't only architectural in this narrow sense — separate work shows even bolt-on retrieval systems (RAG) fail at structural levels like embedding-dimension limits and semantic-vs-relevance mismatch Where do retrieval systems fail and why?, and that hierarchical designs separating query planning from synthesis outperform flat ones on multi-hop retrieval Do hierarchical retrieval architectures outperform flat ones on complex queries?. The throughline across all of it: retrieval quality tracks how much addressable structure you preserve versus how aggressively you compress — and on that axis, attention's per-token addressability is the thing SSMs gave up for efficiency.


Sources 7 notes

Can state-space models match transformers at copying and retrieval?

Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can a single model replace retrieval for long-term conversation memory?

COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.

Why do transformers need explicit chain-of-thought reasoning?

Feedforward transformers lack native recurrent state-tracking and must push evolving state deeper into layers, eventually exhausting depth. Explicit chain-of-thought externalizes this state into tokens as a costly patch for a structural deficiency.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Do hierarchical retrieval architectures outperform flat ones on complex queries?

Separating query planning from answer synthesis into distinct components reduces interference and improves multi-hop query performance. This architectural principle mirrors documented benefits of separating planning from execution in agent design.

Next inquiring lines