INQUIRING LINE

Can recurrent state mechanisms process longer sequences than attention-based working memory approaches?

This explores whether recurrent mechanisms — models that carry a compressed running state forward — can handle longer inputs than attention, which compares every token to every other, and what each approach trades away.


This explores whether recurrent mechanisms — models that carry a compressed running state forward — can handle longer inputs than attention, which compares every token to every other, and what each approach trades away. The corpus says yes, on raw length, recurrence wins decisively: a fine-tuned GPT-2 with recurrent memory augmentation processes up to 11 million tokens by *filtering* irrelevant content rather than attending to all of it, exactly where attention-based models degrade and fixate on the start of the input Can recurrent memory scale where attention fails on ultra-long text?. The reason attention struggles isn't mysterious — its cost grows with the square of the sequence, so doubling the input quadruples the work, which is why architectures that scale to millions of tokens lean on a compressed state instead.

But length isn't the whole story, and here the corpus pushes back in a way you might not expect. There's a provable limit baked into recurrence: because a recurrent model squeezes everything into a *fixed-size* state, it can't faithfully copy or retrieve long spans the way attention can. Two-layer transformers can copy exponentially long strings; state-space models hit a wall set by how much their latent state can hold Can state-space models match transformers at copying and retrieval?. So the honest answer is a split decision — recurrence reads further, attention remembers verbatim better. Length and fidelity are different axes.

The most interesting moves in the corpus refuse the either/or. Titans bolts a neural long-term memory onto attention, keeping attention for short-range work while a separate module adaptively memorizes *surprising* tokens, scaling past 2M context without the quadratic penalty Can neural memory modules scale language models beyond attention limits?. TransformerFAM goes further with no new weights at all — it loops the transformer's attention back onto its own latent representations, growing an emergent working memory for indefinitely long inputs Can models learn working memory by attending to their own latents?. And ReadAgent sidesteps the architecture question entirely by acting human: it compresses documents into 'gist memories' up front and fetches details only when a task demands them, stretching effective context 3–20× Can LLMs read long documents like humans do?.

Here's the thing you didn't know you wanted to know: a recent line of work argues the real bottleneck was never memory *capacity* at all — it's the *compute* needed to fold evicted context into the model's internal state Is long-context bottleneck really about memory or compute?. Recurrence may be valuable less for storage and more as a *consolidation* tool — running extra passes with no new input to transfer recent context into persistent fast weights, mirroring how the hippocampus replays memories during sleep Can recurrence consolidate memory without predicting tokens?. That reframes the whole question: it's not 'which mechanism holds more,' but 'which mechanism spends compute to turn experience into durable state.'

One sobering footnote keeps both camps humble. Reasoning accuracy collapses long before any architecture runs out of room — dropping from 92% to 68% with just 3,000 tokens of padding, far under the context limit Does reasoning ability actually degrade with longer inputs?. Processing a longer sequence and *reasoning well over* a longer sequence are not the same achievement, and so far no mechanism here fully closes that gap.


Sources 8 notes

Can recurrent memory scale where attention fails on ultra-long text?

Fine-tuned GPT-2 with recurrent memory augmentation processes up to 11 million tokens and enables multi-hop reasoning by selectively filtering irrelevant content, where attention-based models degrade and concentrate on early input.

Can state-space models match transformers at copying and retrieval?

Two-layer transformers can copy exponentially long strings while state-space models are fundamentally limited by their fixed-size latent state. Empirically, transformers dramatically outperform SSMs at copying and context retrieval in both synthetic and pretrained settings.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

Can models learn working memory by attending to their own latents?

TransformerFAM demonstrates that adding a feedback loop lets transformers attend to their own latent representations, fostering emergent working memory for indefinitely long inputs. The approach requires no additional weights and improves long-context performance at 1B, 8B, and 24B scales.

Can LLMs read long documents like humans do?

ReadAgent compresses documents into gist memories before knowing the task, then retrieves details only when needed, extending effective context 3–20× and outperforming retrieval baselines on long-document QA.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can recurrence consolidate memory without predicting tokens?

Language models can use recurrent passes without input tokens to transfer recent context into persistent fast weights via learned local rules, mirroring hippocampal replay during biological sleep. This separates consolidation from prediction, enabling different scheduling and compute allocation.

Does reasoning ability actually degrade with longer inputs?

FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst testing whether recurrent state mechanisms truly outperform attention-based working memory on long sequences—a question treated as still open, since the field moves fast and constraints dissolve.

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–May 2026. The library reports:
- Recurrent memory augmentation processes up to 11M tokens by filtering irrelevant content, while attention degrades; attention cost scales quadratically (Feb 2024, arXiv:2402.10790).
- Two-layer transformers can copy exponentially long strings; state-space models hit a fixed-state wall and cannot faithfully retrieve long spans (Feb 2024, arXiv:2402.01032).
- Reasoning accuracy collapses from 92% to 68% with just 3,000 tokens of padding—far before any architecture exhausts context, meaning length ≠ usability (Feb 2024, arXiv:2402.14848).
- Hybrid approaches (Titans with adaptive memorization, TransformerFAM with feedback attention) scale past 2M context without quadratic penalty (Dec 2024–Apr 2024, arXiv:2501.00663, arXiv:2404.09173).
- The true bottleneck may be compute to consolidate evicted context into persistent state, not storage capacity itself (May 2026, arXiv:2605.26099).

Anchor papers (verify; mind their dates):
- arXiv:2402.10790 (Feb 2024): recurrent memory at 11M tokens
- arXiv:2402.01032 (Feb 2024): transformers beat state-space at copying
- arXiv:2501.00663 (Dec 2024): Titans adaptive memorization
- arXiv:2605.26099 (May 2026): consolidation as compute problem

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding, judge whether newer models (e.g., o1, Gemini 2.0), training methods (streaming, online learning), SDKs (vLLM KV cache compression), or multi-agent orchestration (agent memory pools, persistent context stores) have relaxed or overturned the 11M-token claim, the copying deficit, or the reasoning collapse. Separate the durable question (does recurrence beat attention on *faithful long-range recall*?) from perishable limits (11M tokens as a hard bound; fixed-state architectures cannot retrieve).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown attention-based systems matching or beating recurrence on length *and* fidelity? Has any shown the consolidation hypothesis is wrong?
(3) Propose 2 research questions that assume the regime has moved: e.g., (a) given prefill-compute budgets that rival inference, can consolidated transformer states beat recurrent fidelity? (b) do multi-turn agent loops (where an agent re-reads context across turns) functionally replicate the consolidation benefit without architectural change?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines