INQUIRING LINE

Why do agents ignore condensed experience in favor of raw data?

This explores a counterintuitive finding: agents lean on raw interaction logs and largely disregard the tidy summaries built from them — and asks why that happens and what it implies for how we build agent memory.


This explores a counterintuitive finding: agents lean on raw interaction logs and largely disregard the tidy summaries built from them. The sharpest evidence comes from a study across 10 models and 9 environments Why do LLM agents ignore condensed experience summaries?: scrambling an agent's raw experience changed its behavior a lot, but scrambling the condensed summaries barely registered. Three forces drive the asymmetry — summaries quietly drop the details that actually mattered, models weight whatever is sitting in immediate context over anything retrieved, and pretrained knowledge lets the model shrug off external experience it could lean on. In other words, the problem isn't that agents can't read summaries; it's that most summaries throw away exactly the load-bearing specifics, and the model already has cheaper places to look.

The corpus suggests the failure is in *how* compression is usually done, not in compression itself. When consolidation is naive — a generic 'summarize the history' pass — it strips the concrete state an agent needs to act. But when an agent folds its own memory into structured schemas (episodic, working, tool) on its own initiative, it cuts token cost without the degradation that plagues lazy summarization Can agents compress their own memory without losing critical details?. The same lesson shows up from the opposite direction: Reflexion deliberately keeps its self-diagnoses *uncompressed*, because the unambiguous, detailed verbal reflection is precisely what makes the memory usable next episode Can agents learn from failure without updating their weights?. Detail is the active ingredient; compression that removes it removes the value.

There's also a fidelity-matching angle worth knowing. One line of work trains an external manager to prune context for a frozen agent, and finds the right amount of compression depends on the agent: strong agents reward high-fidelity preservation, weak agents need aggressive pruning to stay reliable Can external managers compress context better than frozen agents?. That reframes 'agents ignore condensed experience' as a calibration problem — a fixed summary handed to a capable model is simply lower-fidelity than what it could have used, so it routes around it.

A more radical reading is that the retrieve-a-summary step is itself the weak link. Instead of storing a pre-digested summary and pulling it back, some systems reconstruct memory on demand by walking a graph and pruning paths as evidence accumulates, which beats fixed retrieve-then-reason pipelines while costing less Can agents reconstruct memory on demand instead of retrieving it?. And memory-as-policy approaches like AgentFly improve behavior entirely through structured memory operations over cases and subtasks — not a flat summary blob — which is closer to how agents actually seem willing to use stored experience Can agents learn continuously from experience without updating weights?.

The thing you didn't know you wanted to know: this connects to where agent reliability comes from at all. The argument that reliability lives in externalized memory, skills, and protocols Where does agent reliability actually come from? only holds if what you externalize keeps its detail and structure. A condensed-experience summary is the failure case of that thesis — it's externalization that discards the very specifics it was supposed to preserve, so the model rationally ignores it and falls back on raw context. The fix isn't 'summarize harder'; it's structure the memory so compression and fidelity stop fighting each other.


Sources 7 notes

Why do LLM agents ignore condensed experience summaries?

Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Can external managers compress context better than frozen agents?

An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.

Can agents reconstruct memory on demand instead of retrieving it?

MRAgent achieves up to 23% gains on reasoning tasks by reconstructing memory through active graph traversal that prunes paths based on accumulated evidence, while reducing token and runtime cost compared to fixed-retrieval pipelines.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Next inquiring lines