INQUIRING LINE

How much actionable detail does condensation strip from raw experience?

This explores what gets lost when an agent's full, raw experience is boiled down into summaries or memory — and whether that lost material is the part that actually changes behavior.


This explores what condensation strips out when raw experience gets compressed into summaries — and the corpus has a sharp, almost uncomfortable answer: it strips out exactly the part that drives action. In a study across 10 models and 9 environments, perturbing an agent's raw experience changed its behavior a lot, while perturbing the condensed summary barely registered Why do LLM agents ignore condensed experience summaries?. The summaries weren't just shorter — they had quietly dropped the details specific enough to act on, so the model leaned back on raw context and pretrained priors instead. Condensation didn't compress the signal; it deleted it.

Why does this happen? One clue comes from how feedback decomposes. Natural experience carries two separate things: an evaluative signal (how well did that go?) and a directive one (what specifically should change?) Can scalar rewards capture all the information in agent feedback?. Summarization tends to preserve the evaluative gist — "this approach worked" — while discarding the directive specifics that tell you what to do differently next time. The actionable detail is the directive part, and it's the first casualty of abstraction. The same logic shows up in memory consolidation, which follows an inverted-U: a little helps, but as experience piles up, LLM-consolidated memory starts failing problems it had already solved — 54% of them in one case — through "applicability stripping" and overfitting Does agent memory degrade when continuously consolidated?. Applicability stripping is condensation's core failure named directly: the summary keeps the conclusion but loses the conditions under which it applies.

But — and this is the part you might not expect — compression isn't doomed. The damage seems to come from *naive* condensation, not condensation itself. A reasoning model's raw thinking trace, used as-is, turns out to be a better context compressor than purpose-built methods, because the act of reasoning already selects what matters Can a reasoning model's thinking trace compress context effectively?. Push further and you can *train* compression to keep the actionable parts: reward-driven training that ties compression rate to whether the downstream task still succeeds produces compact traces that beat competitors by 17–23% at 4–8x compression Can thinking traces be made reliably budget-controllable?. The difference is that the objective explicitly punishes throwing away detail that mattered.

The design lesson running across these is about *what* you condense, not *how much*. Step-level confidence filtering catches reasoning breakdowns that whole-trace averaging smooths over Does step-level confidence outperform global averaging for trace filtering? — granularity preserves the failure signal that aggregation erases. And DeepAgent's autonomous memory folding avoids the degradation that plagues other consolidation by sorting interactions into structured episodic, working, and tool schemas rather than mashing them into one prose summary Can agents compress their own memory without losing critical details?. So the answer to "how much does condensation strip?" is: nearly all of the actionable detail, *if* you condense by summarizing toward the gist — but very little, if the condensation is structured, reward-grounded, or done at the granularity where the actionable signal actually lives.


Sources 7 notes

Why do LLM agents ignore condensed experience summaries?

Across 10 LLM models and 9 environments, perturbing raw experience changed agent behavior significantly, while altering condensed experience had minimal effect. Three causes drive this asymmetry: summaries lose critical details, models favor immediate context over retrieved information, and pretrained knowledge reduces reliance on external experience.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Does agent memory degrade when continuously consolidated?

LLM-consolidated textual memory degrades as experience accumulates, eventually performing worse than episodic-only retention. GPT-5.4 failed 54% of previously-solved problems after consolidation, with three mechanisms identified: misgrouping, applicability stripping, and overfitting on narrow streams.

Can a reasoning model's thinking trace compress context effectively?

A reasoning model's raw thinking trace, used directly as shortened context, outperforms most dedicated compression methods without requiring specialized modules or compression-specific training. The mechanism that enables reasoning also produces usable input compression.

Can thinking traces be made reliably budget-controllable?

Reward-driven training that couples compression rate to downstream task quality elicits compact, controllable traces. At 4x and 8x compression, this approach beats competitors by 17–23% F1 and transfers across models.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about what condenses away during summary compression in agentic LLM workflows — specifically, whether actionable detail (directive signals, applicability conditions, granular failure markers) survives modern condensation, reward-driven optimization, or structured memory folding.

What a curated library found — and when (2024–2026, dated claims, not current truth):
• Raw experience perturbation shifted agent behavior significantly; condensed summaries barely registered (2026). Models leaned on raw context + priors instead, suggesting summaries had stripped directive specifics needed for action.
• Feedback decomposes into evaluative ("worked well") and directive ("change this specifically") signals; summarization preserves evaluation, discards directives—the actionable part (2026).
• LLM-consolidated memory follows inverted-U: beyond a threshold, models fail 54% of previously solved problems via "applicability stripping," losing conditions for when conclusions apply (2025).
• Reasoning traces compress naturally by selecting what matters; reward-driven optimization (tying compression rate to task success) yields 17–23% better performance at 4–8x compression (2025).
• Step-level confidence filtering preserves failure signals that whole-trace averaging erases; structured episodic/working/tool schemas avoid degradation plaguing prose summaries (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2601.22436 (2026) — Self-Evolving agents' fidelity to raw experience vs. summaries
• arXiv:2605.28713 (2025) — Reasoning as compression; thinking traces as context selectors
• arXiv:2605.12978 (2025) — Continuous memory consolidation degradation curve
• arXiv:2510.21618 (2025) — DeepAgent's structured memory folding

Your task:
(1) RE-TEST EACH CONSTRAINT. For "naive summarization strips actionable detail," ask: do post-2026 scaling laws, routing-based memory (sparse retrieval vs. dense consolidation), or retrieval-augmented memory (fresh raw traces on demand) now *bypass* the need to preserve actionability in summaries themselves? Does reward-driven compression now reliably preserve directives, or do edge cases remain? Cite what resolves or still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Do newer papers show that evaluative summaries suffice when paired with online re-reasoning, or that directive signals can be reconstructed post-hoc?
(3) Propose 2 research questions that assume the regime may have shifted: (a) If sparse retrieval of raw traces replaces dense memory consolidation, does actionability become a retrieval ranking problem rather than a compression design problem? (b) Can multi-stage condensation—evaluative summary for routing, directive trace for action—decouple the two signals so each can optimize independently?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines