INQUIRING LINE

What is the right granularity level for agent memory to enable both reuse and composition?

This explores whether there's a single 'correct' size for the units an agent stores and recombines — and the corpus's blunt answer is that granularity isn't a fixed dial but a variable matched to the task and to a tradeoff between reuse and interference.


This explores whether there's a single right size for the chunks an agent remembers — and the most direct finding in the corpus is that the question has no universal answer: granularity is conditional on what your tasks actually vary over. One line of work argues memory should be sliced along three predictable axes — workflow-level memory wins where tasks are routine and only the arguments change, causal-rule memory wins where the environment drives variance, and fine-grained state-action memory wins where the unpredictability lives in UI state Does agent memory work better at one level of abstraction?. So 'reuse vs. composition' isn't a property of memory in the abstract; it's a property of the match between chunk size and where your task's variance comes from.

The sharpest tension surfaces in web agents, where coarse, reusable workflow templates actively hurt. Indexing procedures by environment state and local action pairs beats higher-level workflow abstractions, because the high-level summaries strip out the click-by-click specifics you need to actually execute Does state-indexed memory outperform high-level workflow memory for web agents?. This is the central reuse/composition bargain made concrete: the more you compress a memory toward reusability, the more you risk discarding the detail that lets it compose into a real action. Generalize too eagerly and you get 'applicability stripping' — a documented failure mode where consolidated textual memory follows an inverted-U and eventually performs *worse* than just keeping raw episodes Does agent memory degrade when continuously consolidated?.

Rather than pick one altitude, several systems hold multiple granularities at once. RAISE splits working memory into dialogue-level components (conversation history, scratchpad) and turn-level components (examples, task trajectory), and crucially notes each granularity has its own failure modes and update rules How should agent memory split across time scales?. AgentFly does something similar by typing memory into case, subtask, and tool modules so credit can be assigned at the right level Can agents learn continuously from experience without updating weights?, and DeepAgent folds history into separate episodic, working, and tool schemas Can agents compress their own memory without losing critical details?. The pattern: don't choose a granularity, maintain a small set of them with clear roles.

The most interesting reframe is that maybe granularity shouldn't be fixed in advance at all. FluxMem lets the memory's link topology form, refine, and prune itself from execution feedback, reaching state-of-the-art precisely by *aligning abstraction dynamically* and eliminating interference between chunks Should agent memory adapt dynamically based on execution feedback?. That turns 'what's the right granularity' from a design-time decision into a learned, closed-loop one — the agent discovers the level at which its experiences reuse cleanly.

Worth knowing if you came here for a number and leave with a principle: the corpus insists the real bottleneck isn't chunk size or storage at all, but curation — what to discard, and how to avoid staleness, drift, and over-generalization Is agent memory capacity or quality the real bottleneck?. The right granularity is the one that survives that quality test. Pick a level that's specific enough to execute, general enough to recur, and back it with a pruning policy — because an un-curated memory makes performance worse no matter how cleverly you sized the chunks.


Sources 8 notes

Does agent memory work better at one level of abstraction?

Workflow-level memory wins in routine-rich domains, causal-rule memory in environment-rich domains, and state-action memory in spatially-rich web tasks. The optimal abstraction depends on whether task variance comes from arguments, causal structure, or fine-grained UI state.

Does state-indexed memory outperform high-level workflow memory for web agents?

PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.

Does agent memory degrade when continuously consolidated?

LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Should agent memory adapt dynamically based on execution feedback?

FluxMem demonstrates that adaptive memory topology—where links form, refine, and consolidate based on closed-loop execution feedback—consistently reaches state-of-the-art across three distinct benchmarks. Dynamic connectivity outperforms fixed retrieval by aligning abstraction and eliminating interference.

Is agent memory capacity or quality the real bottleneck?

The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher auditing whether agent memory granularity still matters—or whether newer architectures, training methods, or inference harnesses have shifted the constraint. The question: what's the right chunk size for memory to enable both reuse and composition in agentic systems?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable.
• Granularity is task-conditional: workflow-level memory wins for routine tasks, causal-rule memory for environment-driven variance, fine-grained state-action for UI state variance (2024–2025).
• Coarse, reusable workflow templates actively hurt web agents; state-dependent procedural memory beats high-level abstraction because compression strips click-by-click detail needed for execution (2024–2025).
• Consolidated textual memory follows an inverted-U utility curve: generalization beyond a threshold degrades performance (2025).
• Best-performing systems hold multiple granularities simultaneously (dialogue-level, turn-level, case/subtask/tool modules) with clear roles and update rules (2025–2026).
• Memory link topology can form and prune itself from execution feedback; dynamic alignment of abstraction outperforms fixed design-time granularity (2026).

Anchor papers (verify; mind their dates):
• arXiv:2409.07429 (2024-09): Agent Workflow Memory
• arXiv:2508.16153 (2025-08): AgentFly—fine-tuning agents without fine-tuning LLMs
• arXiv:2605.28773 (2026-05): Rethinking Memory as Continuously Evolving Connectivity
• arXiv:2605.12978 (2026-05): Useful Memories Become Faulty When Continuously Updated

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above—especially the inverted-U utility curve and the claim that coarse memory hurts—judge whether context window expansion, mixture-of-experts routing, vector retrieval refinement, or new update/decay schedules have relaxed the tradeoff. Has the bottleneck shifted from granularity to curation? Cite what resolved or still constrains each claim.
(2) Surface the strongest work from the last ~6 months showing either multi-granularity synthesis beating single-level memory, or conversely, unified memory representations outperforming modular stacks. Flag disagreements in the corpus.
(3) Propose 2 research questions assuming the regime may have moved: (a) Does continuous link pruning now dominate design-time granularity choice across scale regimes? (b) Under what conditions does memory staleness (arXiv:2605.12978) overwhelm the benefits of multi-level composition?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines