INQUIRING LINE

Inquiring lines›How does AI reshape human reasonin…›How do scale, context, and measure…›What memory architectures best sup…›this inquiring line

The same fixed memory that helps AI tune out noise makes it incapable of copying long passages word for word.

How do fixed recurrent states trade off copying accuracy for filtering ability?

This explores a structural tension in models that compress everything into a fixed-size memory (state-space models, RNNs, neural memory modules): the same bottleneck that forces them to *forget and filter* is what stops them from *copying verbatim*.

This explores a structural tension in models that carry a fixed-size recurrent state: the same compression that lets them filter and summarize is what prevents them from copying long sequences exactly. The cleanest statement of the cost side comes from a proof that two-layer transformers can copy exponentially long strings while state-space models cannot — because an SSM has to cram the whole past into a latent vector of bounded size, it provably loses the ability to retrieve arbitrary earlier tokens Can state-space models match transformers at copying and retrieval?. Copying is the worst case for a fixed state: it demands you preserve *everything*, which is exactly what a bottleneck refuses to do.

But flip the framing and the bottleneck becomes a feature. Filtering — deciding what's worth keeping — is the whole point of a compressed state, and recent work suggests models do this adaptively rather than uniformly. Hidden states sparsify under out-of-distribution or hard inputs, and that sparsification looks like a deliberate selective filter that stabilizes performance rather than a failure Do language models sparsify their activations under difficult tasks?. The complementary finding is that this filtering behavior is *learned*: networks build dense representations for familiar data and fall back to sparse ones for the unfamiliar, a kind of consolidation through exposure Is representational sparsity learned or intrinsic to neural networks?. So the fixed state isn't just lossy — it's lossy in a shaped, information-prioritizing way.

The interesting architectural moves try to refuse the trade-off rather than accept it. Titans bolts a long-term neural memory module onto attention and explicitly prioritizes *surprising* tokens for storage — letting attention handle exact short-range copying while the compressed memory keeps a filtered digest of the far past, scaling past 2M tokens without quadratic cost Can neural memory modules scale language models beyond attention limits?. That's the trade-off made into a division of labor: precise-but-expensive retrieval for what's near, lossy-but-cheap filtering for what's far. Hierarchical recurrence does something adjacent in the depth dimension, coupling a slow planning loop with a fast computation loop so a small recurrent model reaches reasoning that fixed-depth transformers can't Can recurrent hierarchies achieve reasoning that transformers cannot?.

There's a deeper reframe lurking here worth the detour. One line of work argues transformers don't really *store* knowledge in retrievable slots at all — the residual stream transmits knowledge as continuous flow, more like an oral performance than a written archive, which is precisely why model knowledge is contextual and hard to edit Do transformer models store knowledge or generate it continuously?. If that's right, then "copying accuracy vs. filtering" isn't an SSM-specific defect; it's a sharper version of a tension every neural sequence model lives inside. The fixed recurrent state just makes the cost legible by putting a hard wall on how much can flow through.

The takeaway a curious reader might not have expected: copying and filtering aren't two tasks a model happens to be good or bad at — they're the two ends of a single dial set by how much state you allow. Give the model unbounded retrievable context (attention) and it copies perfectly but pays quadratically; squeeze it into a fixed vector and it filters elegantly but cannot quote you back. The frontier work isn't choosing a point on that dial — it's building two memories so you don't have to.

Sources 6 notes

Can state-space models match transformers at copying and retrieval?

Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can recurrent hierarchies achieve reasoning that transformers cannot?

The Hierarchical Reasoning Model couples slow abstract planning with fast detailed computation across two timescales, achieving near-perfect performance on Sudoku and mazes where chain-of-thought methods fail completely. With only 27M parameters and 1,000 samples, HRM escapes the AC0/TC0 complexity ceiling that constrains fixed-depth transformers.

Show all 6 sources

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Topological Trouble With Transformers2.48 match · arxiv ↗
Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs1.79 match · arxiv ↗
Repeat After Me: Transformers are Better than State Space Models at Copying1.77 match · arxiv ↗
Titans: Learning to Memorize at Test Time1.73 match · arxiv ↗
Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers1.69 match · arxiv ↗
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control1.68 match · arxiv ↗
A Mechanistic Analysis of Looped Reasoning Language Models1.67 match · arxiv ↗
Differential Transformer1.61 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about the copying–filtering trade-off in sequence models with fixed recurrent states. The question remains: do bounded latent vectors inevitably sacrifice copying fidelity for selective filtering, or have recent advances (model scaling, training methods, architectural hybrids, or new evaluation harnesses) relaxed this constraint?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable.

• Two-layer transformers provably copy exponentially long sequences; SSMs with fixed state cannot — state bottleneck fundamentally prevents arbitrary token retrieval (2024-02, arXiv:2402.01032).
• Hidden states sparsify adaptively under OOD shift, suggesting learned selective filtering rather than uniform compression failure; representational density correlates with training data familiarity (2026-03, arXiv:2603.03415).
• Titans hybrid: neural memory module + attention decouples short-range exact copying (attention) from long-range lossy filtering (memory), scaling past 2M tokens without quadratic cost (2024-12, arXiv:2501.00663).
• Hierarchical recurrence couples slow planning with fast computation, achieving effective depth that fixed-depth transformers cannot; suggests depth-wise division of labor (2026-06, arXiv:2506.21734).
• Transformers transmit knowledge as continuous flow in residual streams, not discrete retrieval slots — contextual and hard to edit, reframing the copying problem as universal to neural sequence models, not SSM-specific (2024-04, arXiv:2402.01032).

Anchor papers (verify; mind their dates):
• arXiv:2402.01032 (2024-02) — transformers vs. SSMs on copying.
• arXiv:2501.00663 (2024-12) — Titans memory module.
• arXiv:2506.21734 (2026-06) — hierarchical recurrence.
• arXiv:2603.03415 (2026-03) — OOD sparsification mechanisms.

Your task:
(1) RE-TEST EACH CONSTRAINT. Does the 2024 proof that SSMs cannot copy exponentially long sequences still hold under: (a) newer SSM parameterizations or hybrid architectures post-2024? (b) scaling laws or training innovations that reshape the state-filtering learned during pretraining? (c) test-time compute / in-context learning that might retrofit unbounded context into a fixed state? Separate the durable question (is there a fundamental information-theoretic limit?) from the perishable limitation (do current SSMs hit it in practice?).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown SSMs *matching* or *exceeding* transformer copying on realistic tasks, or shown hybrid architectures achieving both without hidden cost?

(3) Propose 2 research questions that ASSUME the regime may have moved: (i) if selective filtering *is* learned and adaptive, does it emerge earlier in training, and can you steer it post-hoc? (ii) can test-time compute (hierarchical reasoning, chain-of-thought, latent search) simulate unbounded state in SSM-like models without breaking the efficiency promise?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The same fixed memory that helps AI tune out noise makes it incapable of copying long passages word for word.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8