INQUIRING LINE

How does procedural memory granularity affect web agent performance?

This explores whether the *level of abstraction* at which a web agent stores its learned procedures — fine-grained click-by-click state, mid-level sub-task routines, or high-level workflows — changes how well it actually performs, and the corpus suggests the answer is that granularity isn't one-size-fits-all but depends on where the task's difficulty lives.


This explores whether the *level of detail* a web agent keeps in its procedural memory matters for performance — and the collection's clearest message is that there's no universally best granularity; the right grain depends on what makes the task hard. The sharpest result comes from PRAXIS, which finds that indexing procedures by *environment state and local action pairs* — essentially remembering "in this UI situation, this click" — beats higher-level workflow abstractions for web agents Does state-indexed memory outperform high-level workflow memory for web agents?. The reason is intuitive once named: web tasks fail on fine-grained UI specifics, and workflow-level summaries throw away exactly the click-by-click detail that determines success.

But that finding flips in other domains, which is the part worth knowing. One note lays out granularity as a *domain-conditional* choice along three axes: workflow-level memory wins in routine-rich tasks where variance is just changing arguments, causal-rule memory wins where the environment's logic is what's hard, and state-action memory wins precisely in spatially-rich web tasks Does agent memory work better at one level of abstraction?. So PRAXIS isn't contradicting the value of workflows — it's confirming that web UI is the corner of the map where the finest grain pays off.

There's a productive tension with Agent Workflow Memory, which gets large gains (24–51%) on web benchmarks by inducing *sub-task* routines — finer than whole tasks but coarser than individual clicks — and compounding them hierarchically Can agents learn reusable sub-task routines from past experience?. Read together, these suggest granularity isn't a single dial but a hierarchy: reusable mid-level routines for structure, plus state-action specifics for the moments where the UI bites.

The corpus also reframes the question from "how detailed?" to "how well curated?" Continuously compressing memory into higher-level abstractions follows an inverted-U: it helps at first, then degrades, with a frontier model failing 54% of previously-solved problems after over-consolidation — through misgrouping, stripping away the conditions under which a procedure applies, and overfitting Does agent memory degrade when continuously consolidated?. That "applicability stripping" is the same failure as losing click-by-click context, just arrived at by a different route. A companion note drives it home: the real bottleneck is quality, not storage — adding capacity without curation actively makes agents worse Is agent memory capacity or quality the real bottleneck?.

The forward-looking thread is that maybe granularity shouldn't be fixed at all. FluxMem lets the memory's topology adapt through execution feedback — links forming, refining, and consolidating based on what actually worked — and reaches state-of-the-art by aligning abstraction dynamically rather than committing to one grain up front Should agent memory adapt dynamically based on execution feedback?. The takeaway for a curious reader: for web agents specifically, finer beats coarser, but the deeper lesson is that the best systems let the task tell them how fine to go.


Sources 6 notes

Does state-indexed memory outperform high-level workflow memory for web agents?

PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.

Does agent memory work better at one level of abstraction?

Workflow-level memory wins in routine-rich domains, causal-rule memory in environment-rich domains, and state-action memory in spatially-rich web tasks. The optimal abstraction depends on whether task variance comes from arguments, causal structure, or fine-grained UI state.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Does agent memory degrade when continuously consolidated?

LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.

Is agent memory capacity or quality the real bottleneck?

The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.

Should agent memory adapt dynamically based on execution feedback?

FluxMem demonstrates that adaptive memory topology—where links form, refine, and consolidate based on closed-loop execution feedback—consistently reaches state-of-the-art across three distinct benchmarks. Dynamic connectivity outperforms fixed retrieval by aligning abstraction and eliminating interference.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing procedural memory granularity claims in web agents. The question remains: does memory grain size matter for web task performance, and if so, how?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026, documenting a shift from fixed granularity to adaptive regimes:
• State-action memory (UI-situation + click pairs) outperforms workflow-level abstractions on web tasks; PRAXIS showed this ~2024, but granularity is domain-conditional—workflows win in routine-heavy tasks, causal-rule memory in logic-heavy ones.
• Sub-task routine induction (hierarchical, mid-level) yields 24–51% gains on web benchmarks; Agent Workflow Memory ~2024 suggests finer-grained hierarchy beats monolithic workflows.
• Continuous memory consolidation follows an inverted-U curve: over-consolidation degrades performance 54% on previously-solved problems via applicability stripping (~2025–2026).
• Quality, not storage capacity, is the bottleneck; unvetted capacity expansion worsens agent behavior (~2026).
• Adaptive topology (FluxMem-like approaches, ~2025–2026) allows memory to refine links and abstraction dynamically during execution, reaching SOTA by decoupling from fixed grain commitments.

Anchor papers (verify; mind their dates):
• arXiv:2409.07429 Agent Workflow Memory (2024)
• arXiv:2605.28773 Rethinking Memory as Continuously Evolving Connectivity (2026)
• arXiv:2605.12978 Useful Memories Become Faulty When Continuously Updated by LLMs (2026)
• arXiv:2604.08224 Externalization in LLM Agents (2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For state-action vs. workflow granularity, verify whether newer harnesses (SDKs, caching strategies, multi-turn orchestration) have narrowed the web-domain advantage, or whether it persists. Separately, probe the inverted-U consolidation curve: has learned-on-the-fly consolidation (agents pruning their own memory) superseded the quality-vs.-storage framing? Check whether DeepAgent (2025-10) or newer reasoning agents have found orthogonal solutions (e.g., hierarchical reasoning steps replacing procedural memory).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months. Look for papers claiming fixed granularity suffices if memory quality is high, or arguing that larger context windows (or retrieval-augmented memory) make granularity moot.

(3) Propose 2 research questions assuming the regime has moved: (a) If adaptive topology now handles granularity, does procedure *linkage* (how memories relate) matter more than grain size? (b) Do multi-agent or hierarchical agent orchestrations decouple procedural granularity from individual agent performance?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines