INQUIRING LINE

Agentic Systems and Tool Use · Model Architecture and Internals · Training, RL, and Test-Time Scalingcross-cluster

Can workflow memory compound reusable skills into measurable success improvements?

This explores whether agents that store and reuse the routines they discover — 'workflow memory' — actually post measurable performance gains, and what makes that compounding work.

This explores whether agents that store and reuse the routines they discover — 'workflow memory' — actually post measurable performance gains, and what makes that compounding work. The corpus says yes, and unusually for this field, with hard numbers. Agent Workflow Memory extracts reusable sub-task routines (not whole-task scripts), strips out the example-specific values, and stacks them hierarchically — yielding a 24.6% relative gain on Mind2Web and 51.1% on WebArena, with the gains *widening* as the gap between training and test tasks grows Can agents learn reusable sub-task routines from past experience?. That last detail is the interesting part: the more novel the situation, the more a library of abstracted routines pays off, because reusable skills generalize where memorized full solutions don't.

The deeper claim underneath the numbers is that reliability comes from *externalizing* cognition rather than scaling the model. One synthesis frames reliable agents as offloading three burdens — memory, skills, and protocols — into a harness layer so the model stops re-solving the same problems Where does agent reliability actually come from?. VOYAGER is the canonical demonstration: an embedding-indexed library of executable skills, with complex skills composed from simpler ones, lets an agent learn continuously and — crucially — avoid the catastrophic forgetting that weight-update methods suffer Can agents learn new skills without forgetting old ones?. So 'compounding' isn't a metaphor; it's literal composition of stored procedures into bigger ones.

But a skill library only compounds if it's *curated*, not just accumulated. SkillOS shows that separating a trainable curator from a frozen executor shifts a repository away from generic, verbose entries toward actionable execution logic and cross-task meta-strategies — and the trained curator transfers across different model backbones Can a separate trained curator improve skill libraries better than frozen agents?. SkillRL adds a sharp twist on *what* to store: treat successes as concrete demonstrations and failures as abstracted lessons. That asymmetry hits state-of-the-art while using far less context than dumping everything in uniformly Should successful and failed episodes be processed differently?. The implication is that naive 'remember everything' memory degrades, while differentiated memory improves both efficiency and the policy.

There's a counterweight worth knowing about, because it's where the compounding story breaks. In long, multi-turn workflows, agents fail not from missing knowledge but from *weak memory control* — transcript replay and retrieval lack gating, so errors and constraint drift accumulate; a bounded, schema-governed committed state fixes it Can agents fail from weak memory control rather than missing knowledge?. The stakes are concrete: frontier models silently corrupt ~25% of document content across extended relay tasks, with errors compounding without plateauing through 50 round-trips Do frontier LLMs silently corrupt documents in long workflows?. So memory compounds in both directions — good routines compound into measurable wins, but ungated memory compounds errors just as reliably.

The thing you might not have known you wanted: the same mechanism — accumulation across a workflow — is what drives both the 51% gains and the 25% corruption. The difference between them is entirely *governance*: abstraction (drop example-specific values), composition (build complex from simple), curation (a trained editor, not a junk drawer), and gating (a bounded committed state). Memory doesn't help because it's memory; it helps when something decides what's worth keeping.

Sources 7 notes

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents fail from weak memory control rather than missing knowledge?

Agent performance degrades in long workflows because transcript replay and retrieval-based memory lack gating mechanisms. A bounded, schema-governed committed state that separates artifact recall from permanent memory write prevents error accumulation and constraint drift.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can workflow memory compound reusable skills into measurable success improvements?

Sources 7 notes

Next inquiring lines