How should memory systems split between short-term and long-term storage?
This explores whether the classic short-term vs. long-term split is even the right way to organize memory — and what the corpus offers as alternatives, from consolidation timing to connectivity to dual management paths.
This explores whether memory systems should split between short-term and long-term storage — and the most interesting answer in the corpus is that the split itself may be the wrong frame. A 2025 survey argues short-term and long-term aren't architectural categories at all but *emergent temporal patterns*, and proposes organizing memory instead along three orthogonal axes: forms (token, parametric, latent), functions (factual, experiential, working), and dynamics (formation, evolution, retrieval) Can three axes replace the short-term long-term memory split?. If you take that seriously, the question shifts from "where do I draw the line between fast and slow?" to "what is each memory *for*, and how does it move between forms over time?"
That said, the practical systems still draw a line — they just draw it in different places. Titans makes it architectural: attention handles short-term context (quadratic, precise, recent) while a separate neural memory module compresses long-term information, prioritizing *surprising* tokens for storage so it scales past 2M tokens without the quadratic penalty Can neural memory modules scale language models beyond attention limits?. RAISE draws the line by granularity rather than duration — dialogue-level memory (conversation history, scratchpad) versus turn-level memory (examples, task trajectory) — and shows each granularity has its own failure modes and update rules How should agent memory split across time scales?. So even when people split memory, the *useful* split is rarely just "recent vs. old."
The deepest reframe is that the hard problem isn't storage capacity, it's *consolidation* — moving information from the fast, expensive working buffer into compressed, durable form. One line of work finds the long-context bottleneck is actually compute, not memory: performance improves the more passes you spend transforming evicted context into internal state during offline "sleep" phases, following a test-time-scaling curve Is long-context bottleneck really about memory or compute?. The Sleep paradigm operationalizes this with knowledge distillation and RL-generated rehearsal ("dreaming") to fold in-context knowledge into weights without catastrophic forgetting Can models consolidate memories during offline sleep phases?, and agent systems do the same at the application layer — DeepAgent autonomously folds interaction history into episodic, working, and tool schemas Can agents compress their own memory without losing critical details?, while ReadAgent compresses documents into "gist memories" up front and fetches details only when a task demands them Can LLMs read long documents like humans do?. In all of these, the short/long boundary is really a *compression gradient*, not a wall.
Two findings should change how you think about the long-term tier specifically. First, what you store matters less than whether it's reachable: FluxMem shows memory usefulness is a *connectivity* problem — links between co-activated units form the subgraph you can actually traverse at decision time, and inert storage with bad topology is worthless Is agent memory a storage problem or a connectivity problem?. Second, how you *index* long-term memory determines whether it transfers: for web agents, procedures indexed by concrete environment state beat tidy high-level workflow abstractions, because the abstractions lose the click-by-click specifics that make a memory actionable Does state-indexed memory outperform high-level workflow memory for web agents?.
So the honest answer is: don't start by splitting on time. Decide *who manages each memory* — the corpus separates an explicit "hot path" where the agent decides what to keep via tool calls from an implicit background path that's triggered programmatically How should agents decide what memories to keep? — then decide each memory's function, how it consolidates, and how it's indexed and linked for retrieval. Short-term vs. long-term falls out of those choices rather than driving them.
Sources 10 notes
A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.
Titans architecture separates attention (short-term, quadratic) from neural memory (long-term, compressed), prioritizing surprising tokens for storage. The model outperforms standard Transformers and linear RNNs across tasks while scaling to 2M+ token contexts without quadratic penalties.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
The Sleep paradigm uses Knowledge Seeding (distilling smaller networks into larger ones) and Dreaming (RL-generated rehearsal) to consolidate in-context knowledge into weights without forgetting. Gains appear in long-context understanding, few-shot reasoning, and continual learning.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
ReadAgent compresses documents into gist memories before knowing the task, then retrieves details only when needed, extending effective context 3–20× and outperforming retrieval baselines on long-document QA.
FluxMem shows that memory usefulness is determined by links between co-activated units forming an accessible subgraph, not by what is stored. Storage is necessary but inert; topology determines whether useful memories are reachable at decision time.
PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.
Memory management decomposes into explicit hot-path (agent decides via tool calling) and implicit background (programmatically triggered) paths. Each approach trades context-sensitivity for reliability differently across generation, storage, retrieval, and deletion.