SYNTHESIS NOTE
Agentic Systems and Tool Use Model Architecture and Internals Training, RL, and Test-Time Scaling

Does agent memory degrade when continuously consolidated?

Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.

Synthesis note · 2026-05-18 · sourced from Memory
Why do multi-agent systems fail despite individual capability? What actually constrains large language models from self-improvement?

The promise of agent memory was straightforward: experience accumulates, gets distilled into reusable lessons, agents become more capable over time. "Useful Memories Become Faulty When Continuously Updated by LLMs" (2605.12978) provides controlled evidence that this promise breaks. Under continuous consolidation, memory utility first rises, then degrades, and ultimately falls below the no-memory baseline. The agent ends up worse than if it had remembered nothing.

The cleanest demonstration uses ARC-AGI Stream: GPT-5.4 fails 54% of problems it had previously solved without memory, after those problems' solutions have been consolidated into the memory bank. The trajectories that produced the success are still there in raw form. The consolidation step itself is destroying the signal.

The paper localizes the failure to consolidation specifically through a clever control: keep the same trajectory pool, vary only the update schedule. Static-All (consolidate the entire pool in one pass) and Stream (consolidate batch-by-batch as trajectories arrive) produce qualitatively different end-state memories from identical inputs. Order and grouping of updates change what the memory becomes — but the underlying experience is fixed. Meanwhile, an episodic-only control that simply appends raw trajectories to context performs competitively with the consolidators. The experience is fine. The consolidation is the bug.

Three mechanisms drive the failure. First, misgrouping: agents pool episodes that do not share underlying structure before abstracting, producing principles that apply to nothing in particular. Second, applicability stripping: even when grouping is correct, the abstraction step drops the conditions under which a lesson holds, so overgeneralized entries interfere with neighboring tasks where they should not apply. Third, overfitting on narrow streams: when the input stream is repetitive, abstraction overfits to seen instances and generalizes poorly even within the same task.

The practical takeaway flips the default. Raw episodes should be treated as first-class evidence, not disposable material to be compressed away. Consolidation should be gated explicitly — selective, delayed, and grounded in trajectories that remain recoverable. The current default, where consolidation fires after every interaction, treats abstraction as cheap; the evidence shows it is costly and easily wrong. Continuously updated textual memory should be treated not as a reliable engine of self-improvement but as a fragile mechanism that can make more experience produce worse memory.

The deeper implication is uncomfortable for the field. Many agent-memory systems rely on the assumption that summarized experience is at worst lossy and at best generalizing. This paper shows it is often actively harmful. Building reliable agentic memory requires LLMs that can consolidate without overwriting the evidence they depend on — and current LLMs cannot.

Inquiring lines that use this note as a source 36

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
17 direct connections · 95 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

continuously consolidated agent memory follows an inverted-U utility curve — degrading below the no-memory baseline because consolidation is fragile