How does procedural memory granularity affect web agent performance?
This explores whether the *level of abstraction* at which a web agent stores its learned procedures — fine-grained click-by-click state, mid-level sub-task routines, or high-level workflows — changes how well it actually performs, and the corpus suggests the answer is that granularity isn't one-size-fits-all but depends on where the task's difficulty lives.
This explores whether the *level of detail* a web agent keeps in its procedural memory matters for performance — and the collection's clearest message is that there's no universally best granularity; the right grain depends on what makes the task hard. The sharpest result comes from PRAXIS, which finds that indexing procedures by *environment state and local action pairs* — essentially remembering "in this UI situation, this click" — beats higher-level workflow abstractions for web agents Does state-indexed memory outperform high-level workflow memory for web agents?. The reason is intuitive once named: web tasks fail on fine-grained UI specifics, and workflow-level summaries throw away exactly the click-by-click detail that determines success.
But that finding flips in other domains, which is the part worth knowing. One note lays out granularity as a *domain-conditional* choice along three axes: workflow-level memory wins in routine-rich tasks where variance is just changing arguments, causal-rule memory wins where the environment's logic is what's hard, and state-action memory wins precisely in spatially-rich web tasks Does agent memory work better at one level of abstraction?. So PRAXIS isn't contradicting the value of workflows — it's confirming that web UI is the corner of the map where the finest grain pays off.
There's a productive tension with Agent Workflow Memory, which gets large gains (24–51%) on web benchmarks by inducing *sub-task* routines — finer than whole tasks but coarser than individual clicks — and compounding them hierarchically Can agents learn reusable sub-task routines from past experience?. Read together, these suggest granularity isn't a single dial but a hierarchy: reusable mid-level routines for structure, plus state-action specifics for the moments where the UI bites.
The corpus also reframes the question from "how detailed?" to "how well curated?" Continuously compressing memory into higher-level abstractions follows an inverted-U: it helps at first, then degrades, with a frontier model failing 54% of previously-solved problems after over-consolidation — through misgrouping, stripping away the conditions under which a procedure applies, and overfitting Does agent memory degrade when continuously consolidated?. That "applicability stripping" is the same failure as losing click-by-click context, just arrived at by a different route. A companion note drives it home: the real bottleneck is quality, not storage — adding capacity without curation actively makes agents worse Is agent memory capacity or quality the real bottleneck?.
The forward-looking thread is that maybe granularity shouldn't be fixed at all. FluxMem lets the memory's topology adapt through execution feedback — links forming, refining, and consolidating based on what actually worked — and reaches state-of-the-art by aligning abstraction dynamically rather than committing to one grain up front Should agent memory adapt dynamically based on execution feedback?. The takeaway for a curious reader: for web agents specifically, finer beats coarser, but the deeper lesson is that the best systems let the task tell them how fine to go.
Sources 6 notes
PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.
Workflow-level memory wins in routine-rich domains, causal-rule memory in environment-rich domains, and state-action memory in spatially-rich web tasks. The optimal abstraction depends on whether task variance comes from arguments, causal structure, or fine-grained UI state.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.
The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.
FluxMem demonstrates that adaptive memory topology—where links form, refine, and consolidate based on closed-loop execution feedback—consistently reaches state-of-the-art across three distinct benchmarks. Dynamic connectivity outperforms fixed retrieval by aligning abstraction and eliminating interference.