INQUIRING LINE

Inquiring lines›How does AI reshape human reasonin…›How do training data and procedure…›How should agents balance memory c…›this inquiring line

Generalizing a lesson makes it reusable — but how do you avoid accidentally discarding the conditions that make it true?

How should abstraction preserve applicability conditions when distilling experience?

This explores the central tension in turning past experience into reusable knowledge: abstraction works by stripping away specifics, but strip too much and you lose track of *when* a lesson actually applies — so how do agents generalize without forgetting the conditions that made the lesson true.

This explores the central tension in turning past experience into reusable knowledge: abstraction works by throwing away specifics, but the conditions under which a lesson holds are themselves a kind of specific. Distill too aggressively and you keep the rule while losing the 'only when…' attached to it. The corpus circles this problem from several angles, and the most useful thread is that good abstraction is *selective*, not *maximal* — it discards example-specific values while deliberately retaining the structural context that signals applicability. Agent Workflow Memory is the cleanest illustration: it abstracts away example-specific values (this URL, that button) but preserves the sub-task routine as a unit, so the routine carries its own 'this is the shape of situation I belong to.' The gains grow precisely as train-test gaps widen Can agents learn reusable sub-task routines from past experience?, which is the signature of an abstraction that generalized without over-generalizing.

The failure mode at the other end is worth naming, because it's what 'preserving applicability conditions' is defending against. Chain-of-thought, on one reading, is abstraction gone wrong: it reproduces the *form* of a reasoning pattern learned in training while losing the conditions under which that pattern is valid — so performance collapses predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching?. That's the diagnostic. An abstraction that has shed its applicability conditions looks fine until the situation drifts, then breaks silently. The same shape shows up in memory work as 'brevity bias' and 'context collapse': when you compress a playbook by rewriting it wholesale, you erase the hard-won detail that told you when each move was right. ACE's answer is to grow contexts through incremental generation-reflection-curation rather than full rewrites, treating the playbook as something you *append conditions to* rather than *summarize away* Can context playbooks prevent knowledge loss during iteration?.

There's a structural counterpoint that reframes the whole question. LLM Programs deliberately hide step-irrelevant context, showing each call only what it needs Can algorithms control LLM reasoning better than LLMs alone?. That sounds like the opposite of preserving conditions — but it isn't. The applicability condition there lives *outside* the abstraction, encoded in the surrounding algorithm's control flow rather than inside the distilled step. This is a genuine design fork: do you bake the 'when' into the abstraction itself (AWM's self-describing routines), or do you keep abstractions context-free and let an external scaffold decide when to invoke them? DeepAgent's memory folding splits the difference, consolidating history into typed schemas — episodic, working, tool — where the schema type itself is a coarse applicability tag Can agents compress their own memory without losing critical details?.

The deeper reason this matters connects to a separate finding the reader might not expect to be relevant: RL post-training seems to teach models *when* to deploy reasoning, not *how* — the capability pre-exists, and what's learned is the routing Does RL post-training create reasoning or just deploy it?. If that's right, then 'applicability conditions' aren't a side-constraint on distilled experience — they're the *main thing being learned*. The abstraction (the reasoning move) was already latent; the valuable distillate is the trigger. AgentFly makes this operational, doing all of its continual learning through memory operations that handle credit assignment — i.e. learning which stored case applies to which new situation — without touching model weights Can agents learn continuously from experience without updating weights?.

The takeaway a curious reader can leave with: 'preserve applicability conditions' is not a footnote to abstraction — it may be the harder and more valuable half. An abstraction without its conditions is just a pattern waiting to misfire on the next distribution shift. The corpus suggests three viable disciplines — keep conditions inside the unit (self-describing routines), keep them in an external scaffold (program control flow), or keep them in a typed memory index (folded schemas) — but all three agree on the negative: never compress the 'when' away to make the 'what' shorter.

Sources 7 notes

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Show all 7 sources

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Useful Memories Become Faulty When Continuously Updated by LLMs2.61 match · arxiv ↗
Rethinking Memory as Continuously Evolving Connectivity1.72 match · arxiv ↗
Are We Ready For An Agent-Native Memory System?1.71 match · arxiv ↗
Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments1.70 match · arxiv ↗
A Survey of Context Engineering for Large Language Models1.66 match · arxiv ↗
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective0.93 match · arxiv ↗
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs0.90 match · arxiv ↗
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory0.90 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher re-testing claims about abstraction and applicability conditions in LLM agents. The question remains open: *How should distilled experience preserve the conditions under which a lesson holds, without collapsing into over-specificity or losing generality?*

What a curated library found — and when (findings span 2023–2026; treat as dated claims, not current truth):
• Agent Workflow Memory preserves applicability by keeping sub-task routines as self-describing units while discarding example-specific values; gains compound as train-test gaps widen (2024-09).
• Chain-of-thought reproduces reasoning *form* while losing the conditions for validity — performance collapses under distribution shift; this is 'abstraction gone wrong' (2025-06).
• Context engineering treats playbooks as *incrementally appended* rather than wholly rewritten, preventing 'brevity bias' and 'context collapse' (2025-10).
• LLM Programs hide step-irrelevant context; applicability conditions live in external algorithm scaffolding, not inside the distilled abstraction (2025-10).
• RL post-training teaches models *when* to deploy reasoning, not *how*; the valuable distillate is the trigger/routing, not the capability (2025-05).

Anchor papers (verify; mind their dates):
• arXiv:2409.07429 (Agent Workflow Memory, 2024-09)
• arXiv:2506.02878 (CoT Is Not True Reasoning, 2025-06)
• arXiv:2510.21618 (DeepAgent, 2025-10)
• arXiv:2605.12978 (Useful Memories Become Faulty, 2026-05)

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, ask: Have newer reasoning models, improved memory indexing, or better orchestration (multi-agent routing, retrieval-augmented agents, dynamic routing) since made any of these limits porous? Concretely: does o1 or stronger post-training change whether CoT *must* collapse under shift? Do typed memory indexes now outperform self-describing routines, or vice versa? Separate the durable question (likely: how to encode 'when' without redundancy) from the perishable limit (possibly: current CoT cannot learn routing). Name what resolved it.
(2) Surface the strongest work from the last 6 months that *contradicts* or *supersedes* the library's tension — e.g., a paper claiming abstraction *should* be maximal, or that conditions live nowhere but in examples themselves, or that continual memory update *does* preserve conditions if done right.
(3) Propose 2 research questions that assume the regime *has* moved: (a) If routing/timing is now the main learned signal, what is the abstraction *of* the routing itself — does it generalize across agent types? (b) If external scaffolds outpace internal descriptions, what makes a scaffold condition-preserving rather than condition-hiding?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Generalizing a lesson makes it reusable — but how do you avoid accidentally discarding the conditions that make it true?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8