Why is long-context compute spent transforming context into internal state rather than storing it?
This explores why long-context models burn compute reshaping incoming text into the model's working representation (its internal state / weights / cache) instead of just parking the raw text in memory and reading it back later.
This explores why long-context models burn compute *transforming* context rather than simply *storing* it — and the short version from the corpus is that storage was never the real constraint. The bottleneck is the work of turning raw text into something the model can actually reason over. One line of research reframes the whole problem this way: the limit isn't memory capacity but the compute needed to consolidate evicted context into fast weights during an offline "sleep" phase, and crucially, performance keeps improving the more consolidation passes you run — a test-time scaling pattern where harder tasks simply earn more transformation Is long-context bottleneck really about memory or compute?. Storage is cheap; understanding is the expense.
Why can't you just keep the raw tokens around? Because raw context degrades when you try to attend over all of it at once. Only a tiny sliver of attention heads — under 5% — actually do the retrieval work, and they're a sparse, intrinsic mechanism that has to dynamically activate on the right spans; prune them and the model hallucinates even though the information is sitting right there in the prompt What mechanism enables models to retrieve from long context?. So having the text in memory guarantees nothing. The state — the transformed, consolidated version — is what makes the information usable.
The corpus is full of strategies that, read together, all amount to "pay compute to convert, don't just hoard." ReadAgent compresses documents into gist memories *before it even knows the task*, then fetches details only when needed, stretching effective context 3–20× Can LLMs read long documents like humans do?. Looped world models refine a latent state through iterative depth instead of adding parameters, spending more computation on harder steps — up to 100× parameter efficiency Can looped computation replace parameter count in world models?. Recursive subtask trees with KV-cache pruning sustain reasoning even while discarding 90% of the cache Can recursive subtask trees overcome context window limits?. In every case the move is the same: transform aggressively, keep little.
There's a fascinating inversion here worth lingering on. If transformed state is the valuable thing, why does so much work go the *other* direction — pushing context back out into external stores? Recursive Language Models park the whole prompt in a Python REPL and query it as an external environment, handling inputs 100× past the context window Can models treat long prompts as external code environments?. MRAgent moves relational reasoning out of storage and into retrieval, reconstructing memory by traversing a graph on demand rather than fetching pre-stored answers Can agents reconstruct memory on demand instead of retrieving it?. The reconciliation: these systems externalize the *raw* material precisely so they can spend their scarce attention-compute transforming only the slice that matters, when it matters — they're not storing instead of transforming, they're deferring the transform.
The thing you might not have known you wanted to know: this reframes long context as a compute-allocation problem, not a capacity problem — and that opens a knob. If the bottleneck is transformation work, you can adaptively *meter* it. An external trained manager can compress hard for weak agents and preserve high fidelity for strong ones, matching transformation effort to who's consuming it Can external managers compress context better than frozen agents?, while playbook-style approaches do incremental structured updates to avoid the detail erosion that naive compression causes Can context playbooks prevent knowledge loss during iteration?. Storing context is a solved, boring problem. Deciding how much compute to spend turning it into usable state — and for whom — is where the field is actually moving.
Sources 9 notes
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
ReadAgent compresses documents into gist memories before knowing the task, then retrieves details only when needed, extending effective context 3–20× and outperforming retrieval baselines on long-document QA.
LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
Recursive Language Models store long prompts in a Python REPL and query them via code execution, avoiding attention degradation. RLMs outperform base models even on shorter prompts while handling inputs two orders of magnitude beyond context windows.
MRAgent achieves up to 23% gains on reasoning tasks by reconstructing memory through active graph traversal that prunes paths based on accumulated evidence, while reducing token and runtime cost compared to fixed-retrieval pipelines.
An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.