INQUIRING LINE

What drives the choice between storing raw episodes versus abstracted rules?

This explores the design trade-off in agent memory: when should a system keep concrete, full-detail records of what happened (raw episodes) versus distilling them into compact, reusable rules or summaries (abstracted rules)?


This explores the design trade-off in agent memory — keep the full concrete record of what happened, or distill it into compact reusable rules — and the corpus suggests the choice is driven less by storage cost than by a surprising fact: models often don't trust their own abstractions. The sharpest finding is that LLM agents lean heavily on raw experience and quietly ignore condensed summaries Why do LLM agents ignore condensed experience summaries?. Across many models and environments, perturbing the raw trace changed behavior, while perturbing the summary did almost nothing — because compression strips the very details the model needed, and pretrained knowledge already covers the generic lessons a summary tends to capture. So the default bias toward 'abstract to save context' can be self-defeating: you pay to summarize, and the model reads past it.

But raw-everything doesn't scale either, which is where the most interesting answer in the collection lives: don't choose globally, choose per outcome. SkillRL treats successful episodes as concrete demonstrations and failures as abstracted lessons Should successful and failed episodes be processed differently?. The intuition is that a success is worth replaying move-for-move — the exact path is the value — whereas a failure is mostly worth one transferable rule ('don't do this'), and keeping the full failed trajectory just burns context. This asymmetry mirrors how human experts remember, and it beats treating every episode the same way.

The risk on the abstraction side gets named directly by work on evolving context: compress too eagerly and you get 'brevity bias' and context collapse, where each rewrite quietly erases detail until the playbook is hollow Can context playbooks prevent knowledge loss during iteration?. The ACE framework's answer is to grow rules incrementally rather than rewrite-and-summarize, which is really a way of getting abstraction's compactness without paying raw experience's forgetting tax. A related instinct shows up in retrieval, where collapsing procedures into uniform chunks destroys the step-to-step structure that 'how-to' knowledge depends on — logic units keep the prerequisites and the ordering intact instead of flattening them How do logic units preserve procedural coherence better than chunks?.

The deepest reframing is that 'raw vs. abstracted' is a special case of matching representation to task. StructRAG routes each query to whichever structure fits its cognitive demands — a table, a graph, an algorithm, or plain chunks — rather than forcing one format on everything Can routing queries to task-matched structures improve RAG reasoning?. Read that way, the real driver isn't a philosophical preference for concrete or compact memory; it's whether the downstream task needs to *replay a specific path* (favor raw) or *recognize a recurring pattern* (favor a rule) — and the systems that win are the ones that keep both and decide case by case.

What you might not have expected to learn: the binding constraint here is often the model's own reading behavior, not disk or context budget. A summary that's technically correct but loses the load-bearing specifics will be ignored even when it's retrieved — so the question 'how much do we abstract?' is really 'how much can we abstract before the model stops believing it?'


Sources 5 notes

Why do LLM agents ignore condensed experience summaries?

Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

How do logic units preserve procedural coherence better than chunks?

THREAD replaces chunks with four-part logic units—prerequisite, header, body, linker—enabling dynamic multi-step retrieval for how-to questions. Linkers explicitly navigate between steps and branches, addressing both the semantic-vs-task-relevance gap in embeddings and the sequential dependency loss in chunk-based RAG.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about agent memory design in LLM systems. The question remains: What drives the choice between storing raw episodes versus abstracted rules?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. A library of agent and RAG papers reports:
• LLM agents systematically ignore condensed summaries while pivoting to raw experience; perturbations to abstracts cause near-zero behavior shift, while raw traces shift behavior measurably (~2026).
• Success episodes benefit from concrete replay; failure episodes compress to single transferable rules, cutting context burn — this asymmetry outperforms uniform storage (~2025).
• Aggressive compression triggers 'brevity bias' and context collapse; incremental rule growth (ACE framework) preserves detail without raw-episode cost (~2025).
• Retrieval systems that flatten procedures into uniform chunks destroy step-to-step structure; logic units with prerequisite headers preserve prerequisite ordering and load-bearing specifics (~2024).
• Task-appropriate routing (table, graph, algorithm, or chunks per query) beats forcing one format; the real driver is whether the task replays a specific path or recognizes a pattern (~2024).

Anchor papers (verify; mind their dates):
• arXiv:2601.22436 (2026-01): Large Language Model Agents Are Not Always Faithful Self-Evolvers
• arXiv:2510.04618 (2025-10): Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
• arXiv:2410.08815 (2024-10): StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information
• arXiv:2406.13372 (2024-06): Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering

Your task:
(1) RE-TEST EACH CONSTRAINT. For claims about model inattention to abstracts, brevity bias, and task-routing benefits: has newer tooling (e.g., improved summarization, multi-modal context, dynamic retrieval harnesses, or agent scaffolding SDKs) since relaxed these limits? Does the 2026 faithfulness critique suggest the abstraction-ignoring pattern is itself unstable or artefactual? Separate the durable insight (task structure should match representation) from perishable limitations (current models trust raw over abstract).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any papers showing models *do* learn to trust compressed summaries, or where raw-episode storage becomes a genuine bottleneck despite the library's claim it isn't.
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Can learned routing (meta-learning which representation type per task) replace heuristic matching? (b) Does fine-tuning on abstraction-following tasks teach models to *rely* on rules, inverting the current bias?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines