What distinguishes formation, evolution, and retrieval as separate memory dynamics?
This explores why a recent reframing of AI agent memory treats forming a memory, changing it over time, and recalling it as three separate processes — rather than lumping them under the familiar short-term vs. long-term split.
This explores why one influential survey of agent memory insists that formation, evolution, and retrieval are *different kinds of activity* rather than stages of one pipeline. The argument starts in Can three axes replace the short-term long-term memory split?, which proposes that what we used to call "short-term" and "long-term" memory aren't built-in architectural compartments at all — they're patterns that emerge from these three dynamics playing out over time. Formation is the act of writing something down (what gets encoded, in what shape). Evolution is what happens to it afterward (compression, revision, decay, contradiction-resolution). Retrieval is the act of pulling it back when it's relevant. Treating them as one process hides the fact that a system can be excellent at one and terrible at another.
The corpus makes the distinction vivid by showing systems that specialize in each. On the *formation* side, Can agents learn from failure without updating their weights? (Reflexion) shows that what you choose to encode matters enormously — binary success/failure feedback produces honest self-diagnoses, and deliberately *not* compressing them at write-time keeps them usable later. On the *evolution* side, Can agents compress their own memory without losing critical details? (DeepAgent) and Can context playbooks prevent knowledge loss during iteration? (ACE) are entirely about what happens to memory *after* it's formed: folding history into episodic/working/tool schemas, or updating a "playbook" through small incremental edits rather than full rewrites. Notably, ACE's whole point is that careless evolution — over-compression — causes "context collapse," a failure that has nothing to do with how the memory was first formed or whether it can be retrieved.
Retrieval, the third axis, turns out to be where memory stops being a lookup and becomes reasoning. Can reasoning systems maintain memory across retrieval cycles? (ComoRAG) shows that *stateful* retrieval — keeping a persistent workspace across multiple retrieval cycles and resolving contradictions as you go — beats stateless multi-step retrieval by a wide margin. That's a clue that retrieval isn't a passive read; it has its own dynamics, its own workspace, its own failure modes. How should agent memory split across time scales? (RAISE) reinforces the larger point in miniature: even "working memory" splits into components with *different update policies and different failure modes*, which is exactly what you'd expect if formation, evolution, and retrieval are genuinely separate.
What makes the separation more than bookkeeping is that the corpus has cases that deliberately collapse one of the three — and the consequences localize. Can reasoning systems forget history without losing coherence? (Atom of Thoughts) throws away evolution and retrieval entirely, making each reasoning state depend only on the current problem, and shows you can keep coherence without historical baggage. Meanwhile Where do memorization errors arise in chain-of-thought reasoning? shows the dark side of unintended formation: "local memorization" from preceding tokens silently writes itself into reasoning and causes up to two-thirds of errors. And Do RL agents accidentally use environments as memory? shows formation can happen *outside the agent entirely* — RL agents offload history into the environment as an external store without any explicit memory objective.
The payoff of the three-axis view is diagnostic precision. A more provocative companion idea, Can cognition work by reusing memory instead of recomputing?, even suggests that cognition itself is mostly *retrieval-as-reuse* — navigating stored inference paths backward rather than recomputing — which would make retrieval the load-bearing axis, not a footnote. So the answer to what distinguishes the three: they fail independently, they're optimized by different mechanisms, and the old short-term/long-term vocabulary obscured all of it. Once you separate them, you can finally ask *which* part of a memory system is broken instead of just calling it "forgetful."
Sources 10 notes
A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.
ComoRAG demonstrates that iterative evidence acquisition with a persistent memory workspace outperforms stateless multi-step retrieval by detecting and resolving contradictions through deeper exploration, achieving up to 11% gains on complex queries.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.
Memory-Amortized Inference proposes intelligence arises from structured reuse of prior inference paths over topological memory, inverting RL's reward-forward logic into cause-backward reconstruction. This duality explains energy efficiency and suggests memory trajectories form the substrate of adaptive thought.