What is the right granularity level for agent memory to enable both reuse and composition?
This explores whether there's a single 'correct' size for the units an agent stores and recombines — and the corpus's blunt answer is that granularity isn't a fixed dial but a variable matched to the task and to a tradeoff between reuse and interference.
This explores whether there's a single right size for the chunks an agent remembers — and the most direct finding in the corpus is that the question has no universal answer: granularity is conditional on what your tasks actually vary over. One line of work argues memory should be sliced along three predictable axes — workflow-level memory wins where tasks are routine and only the arguments change, causal-rule memory wins where the environment drives variance, and fine-grained state-action memory wins where the unpredictability lives in UI state Does agent memory work better at one level of abstraction?. So 'reuse vs. composition' isn't a property of memory in the abstract; it's a property of the match between chunk size and where your task's variance comes from.
The sharpest tension surfaces in web agents, where coarse, reusable workflow templates actively hurt. Indexing procedures by environment state and local action pairs beats higher-level workflow abstractions, because the high-level summaries strip out the click-by-click specifics you need to actually execute Does state-indexed memory outperform high-level workflow memory for web agents?. This is the central reuse/composition bargain made concrete: the more you compress a memory toward reusability, the more you risk discarding the detail that lets it compose into a real action. Generalize too eagerly and you get 'applicability stripping' — a documented failure mode where consolidated textual memory follows an inverted-U and eventually performs *worse* than just keeping raw episodes Does agent memory degrade when continuously consolidated?.
Rather than pick one altitude, several systems hold multiple granularities at once. RAISE splits working memory into dialogue-level components (conversation history, scratchpad) and turn-level components (examples, task trajectory), and crucially notes each granularity has its own failure modes and update rules How should agent memory split across time scales?. AgentFly does something similar by typing memory into case, subtask, and tool modules so credit can be assigned at the right level Can agents learn continuously from experience without updating weights?, and DeepAgent folds history into separate episodic, working, and tool schemas Can agents compress their own memory without losing critical details?. The pattern: don't choose a granularity, maintain a small set of them with clear roles.
The most interesting reframe is that maybe granularity shouldn't be fixed in advance at all. FluxMem lets the memory's link topology form, refine, and prune itself from execution feedback, reaching state-of-the-art precisely by *aligning abstraction dynamically* and eliminating interference between chunks Should agent memory adapt dynamically based on execution feedback?. That turns 'what's the right granularity' from a design-time decision into a learned, closed-loop one — the agent discovers the level at which its experiences reuse cleanly.
Worth knowing if you came here for a number and leave with a principle: the corpus insists the real bottleneck isn't chunk size or storage at all, but curation — what to discard, and how to avoid staleness, drift, and over-generalization Is agent memory capacity or quality the real bottleneck?. The right granularity is the one that survives that quality test. Pick a level that's specific enough to execute, general enough to recur, and back it with a pruning policy — because an un-curated memory makes performance worse no matter how cleverly you sized the chunks.
Sources 8 notes
Workflow-level memory wins in routine-rich domains, causal-rule memory in environment-rich domains, and state-action memory in spatially-rich web tasks. The optimal abstraction depends on whether task variance comes from arguments, causal structure, or fine-grained UI state.
PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.
LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
FluxMem demonstrates that adaptive memory topology—where links form, refine, and consolidate based on closed-loop execution feedback—consistently reaches state-of-the-art across three distinct benchmarks. Dynamic connectivity outperforms fixed retrieval by aligning abstraction and eliminating interference.
The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.