INQUIRING LINE

What distinguishes working memory from strategic memory in agent task execution?

This reads the question as asking how the moment-to-moment state an agent holds while doing a task differs from the higher-level planning and lessons it carries across tasks — and the corpus suggests the line isn't a storage location but a difference in what the memory is *for*.


This explores the split between the transient state an agent uses to execute the step in front of it (working memory) and the accumulated planning knowledge it draws on to decide what to do (strategic memory) — and the most striking thing in the corpus is that this distinction shows up as two separate *learning phases*, not just two storage bins. Across eight models, RL training reliably moves through a first phase where getting the execution right is the bottleneck, followed by a second phase where strategic planning becomes the thing that's hard — and you can watch it happen, because the entropy on planning tokens rises while execution entropy settles down Does RL training follow a predictable two-phase learning sequence?. Working memory is what stabilizes first; strategic memory is what's still being figured out late.

The same asymmetry appears in how agents are taught to *store* the two. SkillRL treats successful episodes as concrete demonstrations you can replay verbatim, and failed episodes as abstracted lessons — exactly the working-vs-strategic divide, where the procedural trace gets kept literally and the strategic insight gets compressed into a rule Should successful and failed episodes be processed differently?. Folding everything through the same consolidation pipeline degrades performance; the two kinds of memory want different treatment because they answer different questions ("how did I do this" vs "what should I do differently").

If you want the cleanest structural map, RAISE decomposes agent working memory into four components on two axes: dialogue-level (the whole conversation, a scratchpad) versus turn-level (examples, the current task trajectory) — and notes that each granularity has its own failure modes and update rules How should agent memory split across time scales?. A broader 2025 survey pushes back on naming any of this "short-term" vs "long-term" at all, arguing the phenomena are better described by *function* — factual, experiential, working — with the temporal feel emerging from how memory forms and gets retrieved rather than from a hard architectural wall Can three axes replace the short-term long-term memory split?. On that view, "strategic memory" is really the experiential/planning function, and "working memory" the active scratchpad function.

The lateral payoff: where strategic knowledge *lives* turns out to depend on the domain. Workflow-level memory wins in routine-rich tasks, causal-rule memory in environment-rich ones, and fine-grained state-action memory in spatially-rich web tasks — the right level of abstraction for your strategic layer is conditional on whether task variance comes from arguments, causal structure, or UI state Does agent memory work better at one level of abstraction?. And the boundary can blur: VOYAGER's externalized skill library is arguably strategic memory made executable — composable procedures that compound over time without the catastrophic forgetting of weight updates Can agents learn new skills without forgetting old ones?, while AgentFly shows an agent can improve its whole policy through memory operations alone, no parameter changes Can agents learn continuously from experience without updating weights?.

The thing worth taking away: the most reliable agents don't hold this distinction in the model's weights at all — they externalize state, skills, and protocols into a harness layer so the model isn't re-solving the same memory problems every turn Where does agent reliability actually come from?. Working vs strategic memory, on that reading, is less a property of cognition than a design decision about which burdens you push outside the model.


Sources 8 notes

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Can three axes replace the short-term long-term memory split?

A 2025 survey reframes agent memory along forms (token/parametric/latent), functions (factual/experiential/working), and dynamics (formation/evolution/retrieval), showing that short/long-term phenomena emerge from temporal patterns rather than architectural separation. This enables precise system comparison and replaces vague implementation-based claims.

Does agent memory work better at one level of abstraction?

Workflow-level memory wins in routine-rich domains, causal-rule memory in environment-rich domains, and state-action memory in spatially-rich web tasks. The optimal abstraction depends on whether task variance comes from arguments, causal structure, or fine-grained UI state.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher re-testing claims about working vs. strategic memory in LLM agents. The question remains: what distinguishes transient execution state from accumulated planning knowledge, and does that distinction hold as agent architectures evolve?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable benchmarks.
- RL training exhibits a two-phase dynamic: execution (working memory) stabilizes first; strategic planning becomes the bottleneck late (~2025).
- SkillRL and similar systems apply differential processing — concrete replay for procedural traces, abstraction for strategic rules (~2025).
- RAISE decomposes working memory into four components across dialogue/turn granularity; each has distinct failure modes (~2025).
- Agent memory now unifies better by *function* (factual, experiential, working) than by temporal bins; strategic memory is the experiential/planning function (~2025–2026).
- Memory granularity (workflow, causal-rule, state-action) is domain-conditional; the right abstraction depends on whether variance comes from arguments, causality, or UI state (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2507.16784 (2025-07, Beyond Context Limits)
- arXiv:2512.13564 (2025-12, Memory Survey: Forms, Functions, Dynamics)
- arXiv:2604.08224 (2026-04, Externalization Review)
- arXiv:2605.12978 (2026-05, Continuously Updated Memory Faults)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the two-phase RL finding, the differential storage claim, and the domain-conditional granularity thesis: have newer training regimes (scaling, synthetic data, multimodal grounding), memory architectures (vector DBs, hybrid retrieval-augmented execution), or agent orchestration tooling (memory caching, multi-agent role-splitting) since relaxed any bottleneck or unified the distinction? Where does the working/strategic split still appear empirically necessary?
(2) Surface the strongest work from the last ~6 months that CONTRADICTS the functional unification thesis or shows a case where the working/strategic boundary remains sharp and irreducible.
(3) Propose 2 research questions that assume the regime has shifted: (a) If externalization (harness-layer memory) now does the heavy lifting, what cognitive burden remains in the model's weights — and is *that* the true working memory? (b) Can a single unified memory system, trained end-to-end, recover the two-phase learning dynamics without explicit architectural separation, or is the split fundamental to learning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines