Does workflow-level memory or state-action memory better capture reusable agent knowledge?
This explores a real debate in the corpus: should an agent remember reusable knowledge as high-level task workflows (abstracted sub-task routines) or as fine-grained state-action pairs tied to specific situations — and the honest answer is that neither wins universally, it depends on where a task's difficulty actually lives.
This explores whether agents better capture reusable knowledge by remembering high-level workflows (abstracted sub-task routines) or by remembering state-action pairs (what to do in each specific situation) — and the most useful finding in the corpus is that the question has no single winner. The cleanest framing comes from work showing that memory granularity is domain-conditional Does agent memory work better at one level of abstraction?: workflow-level memory wins in routine-rich domains where the same procedure repeats with different arguments, causal-rule memory wins in environment-rich domains, and state-action memory wins in spatially-rich web tasks where success hinges on click-by-click UI detail. So the right unit of memory tracks where a task's variance comes from — arguments, causal structure, or fine-grained interface state.
You can see both sides argued directly. The case for workflows: Agent Workflow Memory induces reusable sub-task routines at finer granularity than whole tasks, strips out example-specific values, and compounds them hierarchically — yielding 24–51% gains that grow as the gap between training and test situations widens Can agents learn reusable sub-task routines from past experience?. That's exactly what you'd want when the skill is portable and only the details change. The case against, for web agents specifically: PRAXIS shows that indexing procedures by environment state and local action pairs beats high-level workflow abstractions, because workflow summaries discard the click-by-click specifics that actually determine whether a web action succeeds Does state-indexed memory outperform high-level workflow memory for web agents?. These two aren't contradictory — they're two points on the same axis, and the domain decides which one pays off.
What's worth knowing is that the workflow-vs-state-action choice is a false binary if you only ever pick one. Several systems hold multiple memory types at once. AgentFly formalizes agent learning as a memory-augmented decision process with three modules — case, sub-task, and tool memory — and improves its policy entirely through memory operations without touching model weights, hitting 87.88% on GAIA Can agents learn continuously from experience without updating weights?. RAISE splits working memory into four components across two time scales (dialogue-level vs. turn-level), arguing each component has its own failure mode and update rule How should agent memory split across time scales?. And DeepAgent's autonomous memory folding consolidates history into episodic, working, and tool schemas so the agent can compress without losing what matters Can agents compress their own memory without losing critical details?. The pattern: capable agents don't choose a granularity, they stratify several and route to the right one.
There's a deeper framing that reframes the whole question. One line of work argues agent reliability comes less from model scale than from externalizing cognitive burdens — memory, skills, and protocols — into a harness layer, so the model stops re-solving the same problems Where does agent reliability actually come from?. Seen this way, "workflow vs. state-action" is one design decision inside the larger move of externalizing procedural knowledge. VOYAGER makes the same point from the skill side: storing executable skills in an embedding-indexed library and composing complex skills from simpler ones lets agents learn for a lifetime without the catastrophic forgetting that weight updates cause Can agents learn new skills without forgetting old ones?. Whether you call the stored unit a workflow, a skill, or a state-action procedure, the reusability comes from putting it in an external, composable store rather than baking it into parameters.
If you want the one-line takeaway: workflow-level memory captures reusable knowledge best when tasks are procedurally repetitive and only arguments vary; state-action memory captures it best when success depends on fine-grained, situation-specific interface detail; and the strongest systems carry both (plus causal and tool memory) and let the domain decide which to consult. The interesting open question the corpus points at is not which to choose, but how an agent should learn to route to the right granularity on its own.
Sources 8 notes
Workflow-level memory wins in routine-rich domains, causal-rule memory in environment-rich domains, and state-action memory in spatially-rich web tasks. The optimal abstraction depends on whether task variance comes from arguments, causal structure, or fine-grained UI state.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.
DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.