INQUIRING LINE

What distinguishes formation, evolution, and retrieval as separate memory dynamics?

This explores why a recent reframing of AI agent memory treats forming a memory, changing it over time, and recalling it as three separate processes — rather than lumping them under the familiar short-term vs. long-term split.


This explores why one influential survey of agent memory insists that formation, evolution, and retrieval are *different kinds of activity* rather than stages of one pipeline. The argument starts in Can three axes replace the short-term long-term memory split?, which proposes that what we used to call "short-term" and "long-term" memory aren't built-in architectural compartments at all — they're patterns that emerge from these three dynamics playing out over time. Formation is the act of writing something down (what gets encoded, in what shape). Evolution is what happens to it afterward (compression, revision, decay, contradiction-resolution). Retrieval is the act of pulling it back when it's relevant. Treating them as one process hides the fact that a system can be excellent at one and terrible at another.

The corpus makes the distinction vivid by showing systems that specialize in each. On the *formation* side, Can agents learn from failure without updating their weights? (Reflexion) shows that what you choose to encode matters enormously — binary success/failure feedback produces honest self-diagnoses, and deliberately *not* compressing them at write-time keeps them usable later. On the *evolution* side, Can agents compress their own memory without losing critical details? (DeepAgent) and Can context playbooks prevent knowledge loss during iteration? (ACE) are entirely about what happens to memory *after* it's formed: folding history into episodic/working/tool schemas, or updating a "playbook" through small incremental edits rather than full rewrites. Notably, ACE's whole point is that careless evolution — over-compression — causes "context collapse," a failure that has nothing to do with how the memory was first formed or whether it can be retrieved.

Retrieval, the third axis, turns out to be where memory stops being a lookup and becomes reasoning. Can reasoning systems maintain memory across retrieval cycles? (ComoRAG) shows that *stateful* retrieval — keeping a persistent workspace across multiple retrieval cycles and resolving contradictions as you go — beats stateless multi-step retrieval by a wide margin. That's a clue that retrieval isn't a passive read; it has its own dynamics, its own workspace, its own failure modes. How should agent memory split across time scales? (RAISE) reinforces the larger point in miniature: even "working memory" splits into components with *different update policies and different failure modes*, which is exactly what you'd expect if formation, evolution, and retrieval are genuinely separate.

What makes the separation more than bookkeeping is that the corpus has cases that deliberately collapse one of the three — and the consequences localize. Can reasoning systems forget history without losing coherence? (Atom of Thoughts) throws away evolution and retrieval entirely, making each reasoning state depend only on the current problem, and shows you can keep coherence without historical baggage. Meanwhile Where do memorization errors arise in chain-of-thought reasoning? shows the dark side of unintended formation: "local memorization" from preceding tokens silently writes itself into reasoning and causes up to two-thirds of errors. And Do RL agents accidentally use environments as memory? shows formation can happen *outside the agent entirely* — RL agents offload history into the environment as an external store without any explicit memory objective.

The payoff of the three-axis view is diagnostic precision. A more provocative companion idea, Can cognition work by reusing memory instead of recomputing?, even suggests that cognition itself is mostly *retrieval-as-reuse* — navigating stored inference paths backward rather than recomputing — which would make retrieval the load-bearing axis, not a footnote. So the answer to what distinguishes the three: they fail independently, they're optimized by different mechanisms, and the old short-term/long-term vocabulary obscured all of it. Once you separate them, you can finally ask *which* part of a memory system is broken instead of just calling it "forgetful."


Sources 10 notes

Can three axes replace the short-term long-term memory split?

A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

Can reasoning systems maintain memory across retrieval cycles?

ComoRAG demonstrates that iterative evidence acquisition with a persistent memory workspace outperforms stateless multi-step retrieval by detecting and resolving contradictions through deeper exploration, achieving up to 11% gains on complex queries.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Do RL agents accidentally use environments as memory?

Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.

Can cognition work by reusing memory instead of recomputing?

Memory-Amortized Inference proposes intelligence arises from structured reuse of prior inference paths over topological memory, inverting RL's reward-forward logic into cause-backward reconstruction. This duality explains energy efficiency and suggests memory trajectories form the substrate of adaptive thought.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a memory systems researcher re-evaluating the formation–evolution–retrieval tripartite model. Is it still the dominant frame for agent memory, or has capability progress in models, retrieval methods, or agent orchestration collapsed or reordered these axes?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. The library claims:
• Formation, evolution, and retrieval are *independent failure modes*, not pipeline stages — e.g., Reflexion (2023) shows encoding *prevents* compression at write-time; ACE (2025) shows over-compression causes "context collapse" independent of retrieval quality (arXiv:2510.04618).
• Evolution (post-hoc memory transformation) is load-bearing: DeepAgent's memory folding (2025) and stateful retrieval in ComoRAG (2025) both outperform stateless alternatives; token-level memorization accounts for ~66% of CoT errors (arXiv:2508.02037).
• Retrieval-as-reuse (memory-amortized inference, arXiv:2508.14143) may be cognition's primary axis, not a lookup; Atom of Thoughts (2025) throws away evolution/retrieval entirely, keeping coherence via Markov reasoning.

Anchor papers (verify; mind their dates):
• arXiv:2512.13564 (Memory in the Age of AI Agents: A Survey, 2025-12) — the foundational survey proposing the three-axis model.
• arXiv:2510.04618 (Agentic Context Engineering, 2025-10) — operationalizes evolution as playbook updates.
• arXiv:2508.10419 (ComoRAG, 2025-08) — reframes retrieval as stateful reasoning, not lookup.
• arXiv:2508.14143 (Memory-Amortized Inference, 2025-08) — proposes retrieval-as-reuse as foundational.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For formation, evolution, and retrieval: has newer model scaling (o1, o3, or equivalents), retrieval-augmented generation (RAG) refinements, or multi-agent orchestration since mid-2025 *collapsed* the distinction between any two axes, or *relocated* the bottleneck? Distinguish durable questions (e.g., *does memory need three subsystems?*) from perishable claims (e.g., *stateless retrieval underperforms*). Cite what moved each constraint.
(2) **Surface strongest CONTRADICTING work from last 6 months.** Does recent work argue the three-axis model is over-factored, or that a different decomposition (e.g., active vs. passive, episodic vs. semantic, on-device vs. external) better predicts failure?
(3) **Propose 2 research questions assuming the regime has moved.** E.g., if formation–evolution–retrieval no longer cleanly separate under new training regimes, what *does* separate modern agent memory failure modes?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines