SYNTHESIS NOTE

Can agents learn reusable sub-task routines from past experience?

Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.

Synthesis note · 2026-05-03 · sourced from Action Models

Agent Workflow Memory (AWM) takes the human heuristic of abstracting routines from past experience and operationalizes it for web agents. The diagnostic claim is that current agents fail at long-horizon tasks not because they lack reasoning but because they cannot extract and reuse sub-task workflows shared across similar tasks — they solve each task in isolation and never accumulate transferable skill structure.

AWM's intervention has two design choices that matter. First, granularity is below the task level: rather than memorizing "Buy dry cat food on Amazon and deliver to my address," the system induces "search for a product on Amazon" — a sub-task that re-appears across many top-level tasks. Second, example-specific contexts are abstracted out — "dry cat food" becomes "{product-name}" — so the workflow is reusable rather than overfit to its source trace.

The compounding effect is the key behavior. Once "find a place by its name" exists, it serves as a building block for "get the zip code of a place." Skill memory therefore grows hierarchically: complex workflows are constructed on top of previously acquired ones. Empirically this produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with a 22.5-point gap on WebArena after only tens of examples. Critically, online AWM's advantage widens as the train-test gap grows — from 8.9 to 14.0 absolute points — because workflow abstractions transfer where memorized trajectories do not.

The implication is that the right unit of agent memory is the sub-task routine with abstracted variables, not the full task trajectory and not generic helpful hints. The unit should be small enough to recur, abstracted enough to transfer, and structured enough to compose — a position that contrasts directly with Does state-indexed memory outperform high-level workflow memory for web agents?, where PRAXIS argues the opposite: that state-indexed local procedures outperform abstracted workflows precisely because abstraction loses the click-by-click specifics web environments demand.

MUSE-Autoskill operationalizes the same compounding principle but adds the two pieces AWM leaves implicit: per-skill memory and cross-agent transfer. Where AWM induces workflow routines for one agent, MUSE attaches a dedicated memory to each skill that accumulates experience across tasks, so a routine does not merely get reused — it gets better with reuse, adapting from runtime feedback. And MUSE shows the resulting skills transfer to other agents with minimal accuracy loss, extending AWM's single-agent compounding into a shareable repository. This makes AWM and MUSE complementary on the same axis as the existing SkillClaw connection (cross-user propagation): AWM = workflow extraction within an agent; MUSE = experience-bearing skills transferable across agents.

Inquiring lines that read this note 68

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does AI adoption affect human skill development and labor equality?

What determines success in training models on multiple tasks?

What memory abstraction level best enables agent knowledge reuse?

How do standardized protocols improve coordination in multi-agent systems?

How do neural networks separate factual knowledge from reasoning abilities?

How does the knowing-doing gap widen as tasks become more complex?

How can AI agents autonomously learn and transfer skills across tasks?

What causes silent corruption to amplify through delegated workflows?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Can subtask-level voting replace sequential revision for improving long-horizon task accuracy?

When should tasks involve human-AI partnership versus full automation?

How do multi-agent systems achieve genuine cooperation and reasoning?

Can agents develop shared abstractions through communication pressure alone?

How do prompt structure and constraints affect model instruction reliability?

Should GUI agents use structured representations instead of raw pixels?

When do multi-agent approaches outperform single model extended thinking?

Does externalizing cognitive work and state improve agent reliability?

What drives capability and cost efficiency in agent systems?

How should agents balance memory condensation to optimize context efficiency?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How should systems govern persistent agent-generated code in shared infrastructure?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Why does specializing to one task make future task learning harder?

Can single-axis benchmarks accurately predict agent deployment success?

Why do estimates for task-level performance differ so much from full job automation timelines?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 106 in 2-hop network ·medium cluster Open in graph ↗

Can agents learn reusable sub-task routines from… Can agents learn better from their failures than s… Does state-indexed memory outperform high-level wo… Can frozen language models continually improve thr… Can agents learn new skills without forgetting old… Can agents learn from failure without updating the… How can agent systems share learned skills across …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can agents learn better from their failures than successes? Does storing reasoning strategies extracted from both successful and failed experiences improve agent learning compared to tracking only successes or raw trajectories? This matters because failures offer preventative lessons that successes alone cannot teach.
ReasoningBank operates at a higher abstraction level than AWM: workflows are procedural ("how to do X"), strategies are conditional ("when X holds, approach Y"); ReasoningBank also incorporates failed experiences as preventative lessons, which AWM does not
Does state-indexed memory outperform high-level workflow memory for web agents? Should procedural memory for web agents be organized around specific environment states and actions, or abstracted into higher-level workflows? This matters because web automation demands precise, context-sensitive recall that workflows might lose.
tension with: AWM claims abstracted workflows transfer best; PRAXIS claims state-indexed local procedures beat abstracted workflows because abstraction loses the click-by-click specifics. Both target web agents on similar benchmarks.
Can frozen language models continually improve through memory structure alone? If agents can't update parameters, what form of textual memory lets them keep learning across trials and transfer to new tasks without retraining?
complements: CLIN abstracts causal-rule memory; AWM abstracts sub-task workflow memory; both argue the *shape* of textual memory matters more than the model. Three-way memory-granularity tension when paired with PRAXIS.
Can agents learn new skills without forgetting old ones? Explores whether externalized skill libraries—storing learned behaviors as retrievable code rather than parameter updates—can solve the catastrophic forgetting problem that plagues continual learning systems.
extends: Voyager builds an ever-growing skill library by synthesis; AWM operationalizes the same compounding principle for web-agent sub-task routines.
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial and error by storing reflections in episodic memory rather than fine-tuning. This matters because it suggests a fundamentally different path to agent adaptation.
extends: Reflexion stores raw trial outcomes; AWM stores abstracted sub-task workflows. The progression is generic-hint → causal-rule → workflow-routine.
How can agent systems share learned skills across users? Individual users operating autonomous agents independently rediscover solutions because systems lack mechanisms to propagate discoveries. Can centralized aggregation and automatic evolution convert isolated experiences into shared capabilities?
complements: AWM is single-agent skill compounding; SkillClaw is cross-agent skill propagation.

Can agents learn reusable sub-task routines from past experience?

Inquiring lines that read this note 68

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 4