Can agents learn reusable sub-task routines from past experience?
Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
Agent Workflow Memory (AWM) takes the human heuristic of abstracting routines from past experience and operationalizes it for web agents. The diagnostic claim is that current agents fail at long-horizon tasks not because they lack reasoning but because they cannot extract and reuse sub-task workflows shared across similar tasks — they solve each task in isolation and never accumulate transferable skill structure.
AWM's intervention has two design choices that matter. First, granularity is below the task level: rather than memorizing "Buy dry cat food on Amazon and deliver to my address," the system induces "search for a product on Amazon" — a sub-task that re-appears across many top-level tasks. Second, example-specific contexts are abstracted out — "dry cat food" becomes "{product-name}" — so the workflow is reusable rather than overfit to its source trace.
The compounding effect is the key behavior. Once "find a place by its name" exists, it serves as a building block for "get the zip code of a place." Skill memory therefore grows hierarchically: complex workflows are constructed on top of previously acquired ones. Empirically this produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with a 22.5-point gap on WebArena after only tens of examples. Critically, online AWM's advantage widens as the train-test gap grows — from 8.9 to 14.0 absolute points — because workflow abstractions transfer where memorized trajectories do not.
The implication is that the right unit of agent memory is the sub-task routine with abstracted variables, not the full task trajectory and not generic helpful hints. The unit should be small enough to recur, abstracted enough to transfer, and structured enough to compose — a position that contrasts directly with Does state-indexed memory outperform high-level workflow memory for web agents?, where PRAXIS argues the opposite: that state-indexed local procedures outperform abstracted workflows precisely because abstraction loses the click-by-click specifics web environments demand.
MUSE-Autoskill operationalizes the same compounding principle but adds the two pieces AWM leaves implicit: per-skill memory and cross-agent transfer. Where AWM induces workflow routines for one agent, MUSE attaches a dedicated memory to each skill that accumulates experience across tasks, so a routine does not merely get reused — it gets better with reuse, adapting from runtime feedback. And MUSE shows the resulting skills transfer to other agents with minimal accuracy loss, extending AWM's single-agent compounding into a shareable repository. This makes AWM and MUSE complementary on the same axis as the existing SkillClaw connection (cross-user propagation): AWM = workflow extraction within an agent; MUSE = experience-bearing skills transferable across agents.
Inquiring lines that use this note as a source 58
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does AI-improved task performance fail to transfer to independent work?
- Can granular sub-task training for function calling improve both open and proprietary models?
- How should GUI agents remember patterns across different software environments?
- How do standardized artifacts improve coordination between multiple tools?
- How does the knowing-doing gap widen as tasks become more complex?
- Why do workflow abstractions fail in embodied agent environments?
- How does spatial density in web UIs break workflow-level memory?
- Can tool adaptation work without freezing the agent in the loop?
- Can domain-expert workflows always decompose into inspectable stages for AI?
- Can agentic reasoning outperform rigid rule-based systems for skill refinement?
- Why does GUI agent memory need different abstraction levels?
- Can subtask-level voting replace sequential revision for improving long-horizon task accuracy?
- Does outsourcing tasks to AI reduce opportunities for skill development?
- What task characteristics determine whether humans or agents should handle work?
- How do task characteristics determine whether to automate or defer or guide?
- Can agents develop shared abstractions through communication pressure alone?
- How should headers index procedural intent differently from keyword chunking?
- Why do static screenshot models fail to capture multi-step UI task intent?
- Can RL-trained meta-agents match or exceed manually designed workflows?
- Does internal task decomposition eliminate overhead from multi-agent coordination?
- How does PRAXIS differ architecturally from Agent Workflow Memory and causal rule learning?
- Why do a-priori procedural specifications fail as environments change and interfaces evolve?
- How do agents discover and select which tools to invoke?
- Why do APIs outperform UIs for agent task completion?
- How do agents discover and construct new APIs from existing applications?
- Can agentic AI tools deliver productivity gains on learning tasks differently?
- What separates good workflow design from poor workflow design?
- Can sub-task handlers be swapped between neural and symbolic systems?
- How do task stream groupings provide long-horizon learning signals for curation decisions?
- Can curator modules trained on one executor transfer to entirely different agent backbones?
- Should agents continuously prune irrelevant links during execution?
- How does procedural memory granularity affect web agent performance?
- How does workflow abstraction compare to state-indexed procedural memory for web agents?
- Can individual skills improve through reuse and accumulate experience across tasks?
- Do learned workflows transfer between different agents with minimal accuracy loss?
- What makes planning, tool use, and reasoning into jointly optimizable subsystems?
- Why does capability discovery become the bottleneck in large agent systems?
- Does workflow-level memory or state-action memory better capture reusable agent knowledge?
- How do agents automatically generate suitable learning tasks based on current capability?
- What makes composable abstractions emerge under performance pressure in agent systems?
- Can skill libraries prevent redundant narrow artifacts from proliferating?
- What lifecycle management prevents in-loop skill creation from bloating an agent?
- How do strategy-level abstractions differ from storing raw task workflows?
- What training method supports dynamic tool discovery in long-horizon agents?
- How do tool invocations drive agentic cost beyond token consumption?
- Can extracted skills transfer effectively across different domains and model architectures?
- Why does specializing to one task make future task learning harder?
- Can we predict which tasks will decompose into modular subnetworks?
- How do cache-dominant workflows change the marginal cost of agent tasks?
- How do external prompt artifacts improve agent behavior compared to inline instructions?
- How should abstraction preserve applicability conditions when distilling experience?
- Why does decomposition ability transfer across domains but solving ability does not?
- Which model capabilities actually matter for sustained workflow delegation?
- Why do agents systematically underuse condensed experience in skill documents?
- Which workflow positions concentrate the most downstream dependencies and influence?
- Why do estimates for task-level performance differ so much from full job automation timelines?
- Why does identifying UI element types and locations enable downstream task learning?
- What specific bookkeeping tasks can environments maintain more reliably than policies?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can agents learn better from their failures than successes?
Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank operates at a higher abstraction level than AWM: workflows are procedural ("how to do X"), strategies are conditional ("when X holds, approach Y"); ReasoningBank also incorporates failed experiences as preventative lessons, which AWM does not
-
Does state-indexed memory outperform high-level workflow memory for web agents?
Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.
tension with: AWM claims abstracted workflows transfer best; PRAXIS claims state-indexed local procedures beat abstracted workflows because abstraction loses the click-by-click specifics. Both target web agents on similar benchmarks.
-
Can frozen language models continually improve through memory structure alone?
If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?
complements: CLIN abstracts causal-rule memory; AWM abstracts sub-task workflow memory; both argue the *shape* of textual memory matters more than the model. Three-way memory-granularity tension when paired with PRAXIS.
-
Can agents learn new skills without forgetting old ones?
Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
extends: Voyager builds an ever-growing skill library by synthesis; AWM operationalizes the same compounding principle for web-agent sub-task routines.
-
Can agents learn from failure without updating their weights?
Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
extends: Reflexion stores raw trial outcomes; AWM stores abstracted sub-task workflows. The progression is generic-hint → causal-rule → workflow-routine.
-
How can agent systems share learned skills across users?
Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?
complements: AWM is single-agent skill compounding; SkillClaw is cross-agent skill propagation.
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Agent Workflow Memory
- Why Do Multi-agent LLM Systems Fail?
- Agent S: An Open Agentic Framework that Uses Computers Like a Human
- Real-Time Procedural Learning From Experience for AI Agents
- MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
- SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
- Towards a Science of Scaling Agent Systems
- LLMs Corrupt Your Documents When You Delegate
Original note title
agent workflow memory induces reusable sub-task routines and compounds them — yielding 24-51 percent relative success gains and snowballing skill complexity