INQUIRING LINE

Does workflow-level memory or state-action memory better capture reusable agent knowledge?

This explores a real debate in the corpus: should an agent remember reusable knowledge as high-level task workflows (abstracted sub-task routines) or as fine-grained state-action pairs tied to specific situations — and the honest answer is that neither wins universally, it depends on where a task's difficulty actually lives.


This explores whether agents better capture reusable knowledge by remembering high-level workflows (abstracted sub-task routines) or by remembering state-action pairs (what to do in each specific situation) — and the most useful finding in the corpus is that the question has no single winner. The cleanest framing comes from work showing that memory granularity is domain-conditional Does agent memory work better at one level of abstraction?: workflow-level memory wins in routine-rich domains where the same procedure repeats with different arguments, causal-rule memory wins in environment-rich domains, and state-action memory wins in spatially-rich web tasks where success hinges on click-by-click UI detail. So the right unit of memory tracks where a task's variance comes from — arguments, causal structure, or fine-grained interface state.

You can see both sides argued directly. The case for workflows: Agent Workflow Memory induces reusable sub-task routines at finer granularity than whole tasks, strips out example-specific values, and compounds them hierarchically — yielding 24–51% gains that grow as the gap between training and test situations widens Can agents learn reusable sub-task routines from past experience?. That's exactly what you'd want when the skill is portable and only the details change. The case against, for web agents specifically: PRAXIS shows that indexing procedures by environment state and local action pairs beats high-level workflow abstractions, because workflow summaries discard the click-by-click specifics that actually determine whether a web action succeeds Does state-indexed memory outperform high-level workflow memory for web agents?. These two aren't contradictory — they're two points on the same axis, and the domain decides which one pays off.

What's worth knowing is that the workflow-vs-state-action choice is a false binary if you only ever pick one. Several systems hold multiple memory types at once. AgentFly formalizes agent learning as a memory-augmented decision process with three modules — case, sub-task, and tool memory — and improves its policy entirely through memory operations without touching model weights, hitting 87.88% on GAIA Can agents learn continuously from experience without updating weights?. RAISE splits working memory into four components across two time scales (dialogue-level vs. turn-level), arguing each component has its own failure mode and update rule How should agent memory split across time scales?. And DeepAgent's autonomous memory folding consolidates history into episodic, working, and tool schemas so the agent can compress without losing what matters Can agents compress their own memory without losing critical details?. The pattern: capable agents don't choose a granularity, they stratify several and route to the right one.

There's a deeper framing that reframes the whole question. One line of work argues agent reliability comes less from model scale than from externalizing cognitive burdens — memory, skills, and protocols — into a harness layer, so the model stops re-solving the same problems Where does agent reliability actually come from?. Seen this way, "workflow vs. state-action" is one design decision inside the larger move of externalizing procedural knowledge. VOYAGER makes the same point from the skill side: storing executable skills in an embedding-indexed library and composing complex skills from simpler ones lets agents learn for a lifetime without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. Whether you call the stored unit a workflow, a skill, or a state-action procedure, the reusability comes from putting it in an external, composable store rather than baking it into parameters.

If you want the one-line takeaway: workflow-level memory captures reusable knowledge best when tasks are procedurally repetitive and only arguments vary; state-action memory captures it best when success depends on fine-grained, situation-specific interface detail; and the strongest systems carry both (plus causal and tool memory) and let the domain decide which to consult. The interesting open question the corpus points at is not which to choose, but how an agent should learn to route to the right granularity on its own.


Sources 8 notes

Does agent memory work better at one level of abstraction?

Workflow-level memory wins in routine-rich domains, causal-rule memory in environment-rich domains, and state-action memory in spatially-rich web tasks. The optimal abstraction depends on whether task variance comes from arguments, causal structure, or fine-grained UI state.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Does state-indexed memory outperform high-level workflow memory for web agents?

PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an agent-memory researcher re-testing claims about memory granularity in LLM agents. The core question remains: does workflow-level or state-action memory better capture reusable agent knowledge — or is the answer domain-conditional?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026. A curated library reported:
• Memory granularity is domain-conditional: workflows win in routine-rich domains (argument variance), state-action wins in spatially-rich web tasks (UI detail), causal rules win in environment-rich domains (~2024–2025).
• Workflow-level memory yields 24–51% gains on routine-heavy tasks as train–test distance widens, by stripping example-specific values and compounding hierarchically (~2024).
• The strongest systems (AgentFly, RAISE, DeepAgent) stratify multiple memory types at once — case, sub-task, tool, episodic, working — rather than choosing one, hitting 87.88% on GAIA (~2025).
• Agent reliability comes from externalizing cognitive burdens (memory, skills, protocols) into a harness layer, not from scaling model weights alone (~2026).
• Long-horizon reasoning now uses "subconscious threads" (compressed latent memory) to bypass context limits; procedural learning can now happen in real-time from experience (~2025–2026).

Anchor papers (verify; mind their dates):
• 2409.07429 Agent Workflow Memory
• 2510.21618 DeepAgent: A General Reasoning Agent with Scalable Toolsets
• 2604.08224 Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness
• 2512.13564 Memory in the Age of AI Agents: A Survey — Forms, Functions and Dynamics

Your task:
(1) RE-TEST THE DOMAIN-CONDITIONAL CLAIM. For each domain (routine-rich, environment-rich, spatially-rich), check whether recent models (o1, Claude 3.5, newer reasoning agents) or new memory architectures (subconscious threads, federation fabrics, skill evolution) have relaxed the constraints. Separate the durable insight (different tasks need different granularities) from what may be obsolete (which granularity wins where). Does real-time procedural learning or collective skill evolution change the calculus?

(2) Surface contradicting or superseding work from the last 6 months. Specifically: does SkillClaw (2604.08377), Useful Memories Become Faulty (2605.26112), or Federation of Agents (2509.20175) suggest the "stratify multiple types" pattern has limits or failure modes?

(3) Propose 2 research questions that assume the regime may have moved:
   – How should an agent learn to *route* among workflow, state-action, and causal memory without human annotation of domain structure?
   – Does externalized memory (harness-layer skills) now make the granularity question moot — i.e., do agents reuse knowledge better by composing *executable* skill libraries than by routing memory queries?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines