INQUIRING LINE

How should memory systems split between short-term and long-term storage?

This explores whether the classic short-term vs. long-term split is even the right way to organize memory — and what the corpus offers as alternatives, from consolidation timing to connectivity to dual management paths.


This explores whether memory systems should split between short-term and long-term storage — and the most interesting answer in the corpus is that the split itself may be the wrong frame. A 2025 survey argues short-term and long-term aren't architectural categories at all but *emergent temporal patterns*, and proposes organizing memory instead along three orthogonal axes: forms (token, parametric, latent), functions (factual, experiential, working), and dynamics (formation, evolution, retrieval) Can three axes replace the short-term long-term memory split?. If you take that seriously, the question shifts from "where do I draw the line between fast and slow?" to "what is each memory *for*, and how does it move between forms over time?"

That said, the practical systems still draw a line — they just draw it in different places. Titans makes it architectural: attention handles short-term context (quadratic, precise, recent) while a separate neural memory module compresses long-term information, prioritizing *surprising* tokens for storage so it scales past 2M tokens without the quadratic penalty Can neural memory modules scale language models beyond attention limits?. RAISE draws the line by granularity rather than duration — dialogue-level memory (conversation history, scratchpad) versus turn-level memory (examples, task trajectory) — and shows each granularity has its own failure modes and update rules How should agent memory split across time scales?. So even when people split memory, the *useful* split is rarely just "recent vs. old."

The deepest reframe is that the hard problem isn't storage capacity, it's *consolidation* — moving information from the fast, expensive working buffer into compressed, durable form. One line of work finds the long-context bottleneck is actually compute, not memory: performance improves the more passes you spend transforming evicted context into internal state during offline "sleep" phases, following a test-time-scaling curve Is long-context bottleneck really about memory or compute?. The Sleep paradigm operationalizes this with knowledge distillation and RL-generated rehearsal ("dreaming") to fold in-context knowledge into weights without catastrophic forgetting Can models consolidate memories during offline sleep phases?, and agent systems do the same at the application layer — DeepAgent autonomously folds interaction history into episodic, working, and tool schemas Can agents compress their own memory without losing critical details?, while ReadAgent compresses documents into "gist memories" up front and fetches details only when a task demands them Can LLMs read long documents like humans do?. In all of these, the short/long boundary is really a *compression gradient*, not a wall.

Two findings should change how you think about the long-term tier specifically. First, what you store matters less than whether it's reachable: FluxMem shows memory usefulness is a *connectivity* problem — links between co-activated units form the subgraph you can actually traverse at decision time, and inert storage with bad topology is worthless Is agent memory a storage problem or a connectivity problem?. Second, how you *index* long-term memory determines whether it transfers: for web agents, procedures indexed by concrete environment state beat tidy high-level workflow abstractions, because the abstractions lose the click-by-click specifics that make a memory actionable Does state-indexed memory outperform high-level workflow memory for web agents?.

So the honest answer is: don't start by splitting on time. Decide *who manages each memory* — the corpus separates an explicit "hot path" where the agent decides what to keep via tool calls from an implicit background path that's triggered programmatically How should agents decide what memories to keep? — then decide each memory's function, how it consolidates, and how it's indexed and linked for retrieval. Short-term vs. long-term falls out of those choices rather than driving them.


Sources 10 notes

Can three axes replace the short-term long-term memory split?

A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.

Can neural memory modules scale language models beyond attention limits?

Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can models consolidate memories during offline sleep phases?

The Sleep paradigm uses Knowledge Seeding (distilling smaller networks into larger ones) and Dreaming (RL-generated rehearsal) to consolidate in-context knowledge into weights without forgetting. Gains appear in long-context understanding, few-shot reasoning, and continual learning.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can LLMs read long documents like humans do?

ReadAgent compresses documents into gist memories before knowing the task, then retrieves details only when needed, extending effective context 3–20× and outperforming retrieval baselines on long-document QA.

Is agent memory a storage problem or a connectivity problem?

FluxMem shows that memory usefulness is determined by links between co-activated units forming an accessible subgraph, not by what is stored. Storage is necessary but inert; topology determines whether useful memories are reachable at decision time.

Does state-indexed memory outperform high-level workflow memory for web agents?

PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.

How should agents decide what memories to keep?

Memory management decomposes into explicit hot-path (agent decides via tool calling) and implicit background (programmatically triggered) paths. Each approach trades context-sensitivity for reliability differently across generation, storage, retrieval, and deletion.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question is: How should memory systems in LLMs and agents split between short-term and long-term storage — and is that split even the right frame?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat these as perishable constraints to re-test:
• The short/long split may be a false dichotomy: temporal patterns are emergent; memory should organize instead along forms (token, parametric, latent), functions (factual, experiential, working), and dynamics (formation, evolution, retrieval) (2025–2026).
• Practical splits work when drawn by *function* and *granularity* (dialogue vs. turn level), not duration; each has distinct failure modes and update rules (2025).
• The bottleneck is *consolidation*, not capacity: compute cost of transforming evicted context into internal state during "sleep" phases follows test-time-scaling curves; rehearsal and distillation fold in-context knowledge into weights (2026).
• Memory usefulness is a *connectivity* problem: inert storage with poor link topology is worthless; reachability matters more than what is stored (2026).
• Long-term memory transfer depends on *indexing by concrete state*, not abstract workflow; state-dependent procedures beat tidy abstractions (2025).

Anchor papers (verify; mind their dates):
• arXiv:2512.13564 — Memory in the Age of AI Agents: A Survey (2025–12)
• arXiv:2501.00663 — Titans: Learning to Memorize at Test Time (2024–12)
• arXiv:2606.03979 — Language Models Need Sleep (2026–06)
• arXiv:2605.28773 — Rethinking Memory as Continuously Evolving Connectivity (2026–05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, scaling methods, or agent orchestration frameworks have since relaxed or overturned it. Separate the durable question (still open: how do you *consolidate* without forgetting?) from the perishable limitation (may be solved: whether emergent temporal patterns are truly unconstrained by architecture). Cite what resolved it; say plainly where constraints still hold.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — any paper that recovers the short/long split as necessary, or proposes a wholly different organizing principle.
(3) Propose 2 research questions that ASSUME the regime has moved: (a) If indexing and connectivity trump temporal position, what *minimal* metadata suffices for perfect retrieval? (b) Does consolidation cost scale better under continual sleep or event-triggered compression?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines