INQUIRING LINE

How do token, parametric, and latent memory forms coexist in single agents?

This explores a 2025 survey's claim that agent memory comes in three physical forms — text the model reads (token), knowledge baked into weights (parametric), and information living in hidden activations (latent) — and asks how a single agent uses all three at once.


This explores how one agent juggles three different *places* memory can live: in the text it shells out to a context window (token), in the model's own weights (parametric), and in the hidden vectors flowing through the network (latent). The cleanest map for this is a recent survey that deliberately retires the old short-term/long-term split and replaces it with three axes — forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (how memory forms, evolves, and gets retrieved) Can three axes replace the short-term long-term memory split?. The payoff of that reframing is that "short-term" and "long-term" stop being architectures and become *temporal patterns* — the same agent can hold a fact in token form for one turn and in parametric form forever, and the difference is duration, not a separate module.

In practice, today's agents lean overwhelmingly on the token form and route around the parametric one. A striking cluster of the corpus shows agents learning *without ever touching their weights*: Reflexion writes verbal self-diagnoses into episodic memory after each failure Can agents learn from failure without updating their weights?, and AgentFly formalizes the whole learning loop as memory operations over case, subtask, and tool stores — hitting 87.88% on GAIA with the base model frozen Can agents learn continuously from experience without updating weights?. The lesson is that what *could* be slow parametric updating gets deliberately pushed into fast, inspectable token memory. RAISE makes the token side even more granular, splitting working memory into dialogue-level and turn-level components that fail and update differently How should agent memory split across time scales?, while DeepAgent shows the token form doesn't have to grow forever — agents can autonomously *fold* their history into episodic, working, and tool schemas to fight context bloat Can agents compress their own memory without losing critical details?.

The latent form is the surprising frontier, and it's where coexistence stops being theoretical. Instead of serializing thoughts back into tokens, agents can pass raw hidden representations to each other: LatentMAS shares state directly through KV caches, cutting tokens 70–84% while *gaining* accuracy because the embeddings preserve reasoning that text would flatten latent-multi-agent-collaboration-achieves-training-free-lossless-information-exch. A companion line uses sparse autoencoders to pull individual, shared, and private "thoughts" out of those hidden states, even catching alignment conflicts before they ever surface as language Can agents share thoughts directly without using language?. So a single agent can be reading token memory, carrying latent memory in its activations, and only occasionally committing anything to parametric weights — three forms genuinely live side by side.

What ties this together is a structural argument the corpus makes repeatedly: reliability comes from *externalizing* cognition rather than scaling the model. One analysis names memory, skills, and protocols as three burdens that reliable agents offload into a harness layer instead of forcing the weights to re-solve them every run agent-reliability-comes-from-externalizing-cognitive-burdens-into-system-structures. Read against the three-forms taxonomy, that's exactly why token and latent memory are doing so much work while parametric memory stays mostly frozen — moving knowledge *out* of the weights is the whole design move.

The thing you might not have expected to want: the boundaries between these forms are porous and increasingly *chosen* per task. The non-linear-prompting work shows a single model can simulate a whole debate in token space and match what multi-agent systems do Can branching prompts replicate what multi-agent systems do?, and the SLM-sufficiency argument suggests the economical agent is heterogeneous by design — cheap small models for routine work, large ones selectively Can small language models handle most agent tasks?. Coexistence, then, isn't three modules bolted together; it's a continuous engineering decision about *where to put a given piece of knowledge* so it's cheap to write, faithful to retrieve, and durable for as long as it's needed.


Sources 10 notes

Can three axes replace the short-term long-term memory split?

A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can agents share thoughts directly without using language?

Research formalizes inter-agent thought sharing via sparse autoencoders that recover individual, shared, and private latent thoughts from hidden states. This approach detects alignment conflicts at the representational level before they manifest in language.

Can branching prompts replicate what multi-agent systems do?

Research shows single LLMs using dynamic persona simulation achieve multi-agent cognitive synergy without multiple model instances. Solo Performance Prompting validates that structured prompting techniques map directly to multi-agent debate architectures, enabling equivalent outcomes through structural equivalence.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **How do token, parametric, and latent memory forms coexist and trade off in single agents—and has that coexistence regime shifted?**

What a curated library found—and when (dated claims, not current truth):
Findings span 2023–2026; treat them as perishable snapshots:
- Agents overwhelmingly use token memory and route around parametric updates; models stay frozen while episodic/working memory absorbs learning (Reflexion, AgentFly ~87.88% GAIA, 2024–2025).
- Token memory decomposes into dialogue and turn-level working components that fail/update at different rates (RAISE, ~2025).
- Latent memory via direct KV-cache sharing cuts token count 70–84% while gaining accuracy; sparse autoencoders extract individual and shared "thoughts" from hidden states (LatentMAS, ~2025–2026).
- Reliability comes from externalizing memory, skills, and protocols into harness layers rather than scaling parametric weights (2026).
- Small models are sufficient for routine agentic tasks; heterogeneous architectures (cheap SLMs + selective large models) outperform monolithic scaling (2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2512.13564 (Memory in the Age of AI Agents: Forms, Functions and Dynamics, 2025–12)
- arXiv:2511.20639 (Latent Collaboration in Multi-Agent Systems, 2025–11)
- arXiv:2510.21618 (DeepAgent: A General Reasoning Agent, 2025–10)
- arXiv:2604.08224 (Externalization in LLM Agents, 2026–04)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding, ask whether latent-to-token bandwidth improvements, sparse autoencoders, or new KV-cache orchestration (e.g., paged attention, hierarchical caching) have *relaxed* the token-bloat constraint or the parametric-update freeze. Separately assess: does the "frozen weights + token memory" pattern hold for reasoning agents, long-horizon tasks, and fine-tuning-averse deployments? What *hasn't* changed: is latent memory still architecturally coupled to single forward passes, or can it now persist across agent rollouts?

(2) **SURFACE STRONGEST CONTRADICTIONS.** Hunt for work in the last 6 months claiming parametric memory *is* still the bottleneck, or arguing latent forms collapse under multi-turn / multi-agent noise. Flag any evidence that externalization trades off interpretability or introduces retrieval brittleness.

(3) **PROPOSE 2 NEW QUESTIONS** assuming the regime has moved: (a) Can sparse autoencoders + persistent latent caches replace episodic token memory entirely for sub-hour horizons? (b) Under what task / compute constraints does parametric fine-tuning re-win over frozen + token/latent?

**Cite arXiv IDs; flag anything you cannot ground in a real paper.**

Next inquiring lines