INQUIRING LINE

How do strategy-level abstractions differ from storing raw task workflows?

This explores the trade-off between memory that stores *generalized strategies* (reusable, abstracted routines) versus memory that stores *concrete task recordings* (the literal click-by-click steps) — and which one actually helps an agent reuse what it learned.


This explores the gap between abstracting what you learned into a reusable strategy and just keeping the raw recording of what you did. The corpus stages this as a genuine fight, not a settled question. On one side, Agent Workflow Memory argues that the win comes from *abstraction*: it extracts sub-task routines at finer granularity than whole tasks, strips out example-specific values (the particular URL, the particular form field), and then compounds those routines hierarchically — yielding 24–51% gains that get *larger* as the gap between training and test tasks widens Can agents learn reusable sub-task routines from past experience?. The whole point is that throwing away the specifics is what makes the memory transfer.

But the corpus also has a sharp dissent. PRAXIS finds the opposite for web agents: indexing procedures by concrete environment state and local action pairs — keeping the click-by-click specifics — beats higher-level workflow abstractions, which it argues *lose* exactly the detail you need to act reliably Does state-indexed memory outperform high-level workflow memory for web agents?. So the difference between strategy-level and raw-workflow memory isn't 'one is better' — it's a bet about how much your future tasks will resemble your past ones. Abstraction pays off when tasks differ; concrete state-indexed recall pays off when reliable execution in a familiar environment matters more than generalizing.

The reason both can be right shows up in a third note: when you separate the planner from the executor, *decomposition ability transfers across domains but solving ability does not* Does separating planning from execution improve reasoning accuracy?. That's the cleanest explanation of the whole tension. Strategy-level abstractions capture the part of skill that generalizes (how to break a problem down); raw workflows capture the part that doesn't (the exact actions that worked here). Storing them at the wrong level means either over-generalizing a brittle plan or memorizing steps that won't replay.

There's a further wrinkle worth knowing: abstractions don't just compress, they *change how an agent searches*. RLAD shows that spending compute on generating diverse abstractions produces structured breadth-first exploration and prevents 'underthinking' — the failure where a model commits to one path too early Can abstractions guide exploration better than depth alone?. A raw stored workflow can't do that; it's a single rail. So a strategy abstraction is also a tool for considering alternatives, not just a smaller way to remember one.

And at the far end, FlowReasoner abandons stored workflows entirely — instead of reusing fixed task-level templates, it generates a fresh multi-agent architecture per query Can AI systems design unique multi-agent workflows per individual query?. That reframes the original question: the spectrum runs from raw recorded workflows, to abstracted reusable strategies, to no stored workflow at all but a *strategy for producing one on demand*. The deeper you go, the more 'memory' stops being storage and starts being a generative skill.


Sources 5 notes

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Does state-indexed memory outperform high-level workflow memory for web agents?

PRAXIS shows that indexing procedures by environment state and local action pairs yields consistent accuracy and reliability gains across VLM backbones on the REAL benchmark, compared to higher-level workflow abstractions that lose click-by-click specifics.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can AI systems design unique multi-agent workflows per individual query?

FlowReasoner demonstrates that meta-agents trained with reinforcement learning and external execution feedback can generate unique multi-agent architectures for each user query, optimizing across performance, complexity, and efficiency—moving beyond fixed task-level workflow templates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **How do strategy-level abstractions differ from storing raw task workflows in agent memory, and which regime suits which deployment?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025, tracking a genuine tension rather than consensus:
• Abstraction wins on generalization: Agent Workflow Memory (2024-09) shows 24–51% gains extracting reusable sub-task routines, larger as train–test task gap widens. The payoff: stripping example-specific values.
• Concrete state-indexing wins on reliability: PRAXIS (state-dependent procedural memory) finds click-by-click specifics beat higher-level workflow abstractions for web agents in familiar environments.
• Decomposition transfers; execution doesn't: Planning ability (how to break problems) generalizes across domains; solving ability (exact actions) does not (2024 finding).
• Abstractions enable breadth-first search: RLAD (2025) shows diverse abstractions prevent 'underthinking' — raw workflows cannot explore alternatives.
• Meta-generation replaces storage: FlowReasoner (2025-04) bypasses fixed templates, generating multi-agent architectures per query — shifting 'memory' from storage to generative skill.

Anchor papers (verify; mind their dates):
• arXiv:2409.07429 Agent Workflow Memory (2024-09)
• arXiv:2407.11511 Reasoning with Large Language Models, a Survey (2024-07)
• arXiv:2504.15257 FlowReasoner: Reinforcing Query-Level Meta-Agents (2025-04)
• arXiv:2511.22074 Real-Time Procedural Learning From Experience for AI Agents (2025-11)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For abstraction vs. raw workflow, judge whether new LLM capability (reasoning models, chain-of-thought scaling), training methods (RL from agent trajectories, in-context few-shot), or orchestration (memory systems, retrieval augmentation, multi-agent frameworks) have since shifted the tradeoff. Does the generalization–reliability tension still hold, or do newer systems dissolve it? Flag where the constraint appears to hold and where it's broken.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Has any 2025-mid-to-late paper shown that one regime dominates the other, or that a hybrid regime (e.g., abstract planning + concrete executor) outperforms both?
(3) **Propose 2 research questions that assume the regime may have moved:** e.g., 'Can learned routing policies automatically choose abstraction level per task?' or 'Do scaling laws favor abstraction or memorization in agent memory?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines