How does externalized state affect the long-context bottleneck in language models?
This explores whether moving context *out* of the attention window — into separate memory modules, consolidated weights, or retrieval systems — actually relieves the long-context bottleneck, or just relocates it.
This explores whether externalizing state — parking context somewhere other than the live attention window — fixes the long-context problem, and the corpus suggests the answer reframes the problem itself. The most striking claim is that the bottleneck isn't storage at all: it's *compute*. One line of work argues that the real cost of long context is the work required to transform evicted context into internal state — consolidating it into fast weights during offline 'sleep' phases, with performance improving as you spend more consolidation passes Is long-context bottleneck really about memory or compute?. If that's right, then externalizing state doesn't make the bottleneck disappear; it moves it from 'how much can I hold' to 'how much can I afford to digest.'
The architecture work points the same direction. Titans-style designs split the job in two: keep attention for the short-term, quadratic-cost window, and hand long-term retention to a separate neural memory module that compresses and stores only *surprising* tokens, scaling past two million tokens without the quadratic penalty Can neural memory modules scale language models beyond attention limits?. This is externalized state done well — but notice it survives by being selective. It isn't holding everything; it's deciding what's worth consolidating, which is the compute-budget problem wearing a different hat.
Why bother externalizing at all? Because keeping things in-context degrades faster than the window size suggests. Reasoning accuracy can fall from 92% to 68% with just a few thousand tokens of padding — far below the model's nominal capacity, task-agnostic, and not fixed by chain-of-thought Does reasoning ability actually degrade with longer inputs?. So the window isn't a clean buffer where more room means more usable memory; it's a place where signal dilutes. That's the case *for* moving state out — but it comes with a catch the corpus is blunt about: models routinely ignore the context you do give them when their trained-in priors are strong, and prompting alone can't override that — it takes intervention in the representations themselves Why do language models ignore information in their context?. Externalized state is only useful if the model actually *reads* it over its own parametric reflexes.
Retrieval is the most familiar form of externalized state, and here the corpus offers a sharp move: don't retrieve constantly, *learn when to*. DeepRAG frames each reasoning step as a decision — pull from outside or trust internal knowledge — and gets a ~22% accuracy gain mostly by *not* retrieving when retrieval would only add noise When should language models retrieve external knowledge versus use internal knowledge?. That closes the loop with the compute-bottleneck framing: whether your external state lives in fast weights, a memory module, or a retrieval index, the win comes from selectivity, not capacity.
The thread worth taking away: the corpus quietly dissolves 'long-context bottleneck' as a memory problem and reassembles it as a *consolidation and selection* problem. There's even a hint that models can internalize this kind of offline processing — using otherwise-wasted sequence space after their output to train self-evaluation at zero inference cost Can models learn to evaluate their own work during training? — suggesting externalized state and internalized state aren't opposites so much as two ends of the same consolidation pipeline.
Sources 6 notes
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.