INQUIRING LINE

Inquiring lines›How does AI reshape human reasonin…›How do training data and procedure…›How should agents balance memory c…›this inquiring line

Can one agent's recipe for deciding what to remember and forget safely transfer to a different agent of equal skill?

Can context management policies transfer across agents of similar capability levels?

This explores whether a learned recipe for what to keep, prune, or compress in an agent's working context is portable — can you train a context policy once and reuse it on a different agent, and does that reuse hold when the two agents are roughly equally capable?

This explores whether a context-management policy — the rules for what to keep, prune, or compress in an agent's working memory — can move from one agent to another, and whether matching capability is what makes that move safe. The corpus suggests the answer is yes, but conditionally: context policy is coupled to an agent's reliability, so it transfers cleanly between agents of similar capability and breaks when capability diverges. The most direct evidence comes from work on offloading context to a trained external manager Can external managers compress context better than frozen agents?, where the manager compresses context optimally by *matching the agent's reliability* — stronger agents get high-fidelity preservation, weaker agents need aggressive pruning to stay coherent. The compression policy isn't a universal constant; it's tuned to the consumer. That's precisely why "similar capability" is the load-bearing condition in your question: the policy is portable exactly across the band where the optimal compression aggressiveness is the same.

What makes this transferable at all is that good context management lives *outside* the model. Reliable agents externalize their memory, skills, and protocols into a harness layer rather than baking them into weights Where does agent reliability actually come from?. Because the policy is a separable component, it can in principle be lifted off one agent and dropped onto another — you're not retraining a model, you're reattaching a module. The same logic shows up in memory-folding schemes where agents compress their own interaction history into structured schemas Can agents compress their own memory without losing critical details?, and in memory-augmented learning where adaptation happens entirely through memory operations without touching parameters Can agents learn continuously from experience without updating weights?. When the policy is structural rather than parametric, transfer becomes a packaging question instead of a training question.

The catch is the failure mode on the other side of the capability gap. A compression policy that keeps a strong agent reliable will starve a weaker one — and vice versa, an aggressive policy tuned for a fragile agent throws away context a capable agent could have used. This is why heterogeneous-capability designs treat the question carefully: the case for running small models on most subtasks Can small language models handle most agent tasks? assumes you're matching the workload (and implicitly the context budget) to the model's competence, not handing a small model a policy built for a large one. Capability isn't just a routing signal here; it's the variable the policy is fit against.

There's a subtler reason transfer might be less fragile than it sounds. If most multi-agent performance variance comes down to token budget rather than coordination cleverness How does test-time scaling work at the agent level?, then a context policy is largely a token-allocation policy — and token economics generalize across agents of comparable size far more readily than task-specific behavior does. That reframes the whole question: you may not need a policy that understands the agent, just one calibrated to the same compute envelope. The thing that doesn't transfer cleanly is coordination context across many agents — distributed systems degrade predictably as they accept neighbor information uncritically Why do multi-agent systems fail to coordinate at scale?, which is a different problem than single-agent context fidelity.

The useful surprise here: "capability level" turns out to be a proxy for *reliability under compression*, and that's the real axis a context policy is fit to. So the honest version of your question is less "do these two agents score similarly on benchmarks" and more "do they degrade the same way when you take context away from them." Two agents can share a capability tier and still tolerate compression differently — and capability-vector approaches that make an agent's competence a first-class, versioned, queryable object Can semantic capability vectors replace manual agent routing? hint at the right substrate for deciding when a policy is safe to reuse: match on the compression-tolerance profile, not the headline capability number.

Sources 8 notes

Can external managers compress context better than frozen agents?

An external RL-trained manager can adaptively prune context for frozen agents, with the key insight that stronger agents benefit from high-fidelity preservation while weaker agents need aggressive compression to stay reliable.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Can agents learn continuously from experience without updating weights?

AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Show all 8 sources

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can semantic capability vectors replace manual agent routing?

Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Towards a Science of Scaling Agent Systems4.23 match · arxiv ↗
From Model Scaling to System Scaling: Scaling the Harness in Agentic AI3.27 match · arxiv ↗
Useful Memories Become Faulty When Continuously Updated by LLMs2.63 match · arxiv ↗
Scaling Behavior of Single LLM-Driven Multi-Agent Systems2.54 match · arxiv ↗
LLMs Corrupt Your Documents When You Delegate2.49 match · arxiv ↗
Rethinking Memory as Continuously Evolving Connectivity1.72 match · arxiv ↗
Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents1.72 match · arxiv ↗
Are We Ready For An Agent-Native Memory System?1.71 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about context-management policy transfer across agents. The question: Can context policies move safely between agents of similar capability levels, and what makes that transfer work or break?

What a curated library found — and when (findings span 2024–2026; treat as dated claims):
• Context-management policies are *separable components* that live outside model weights, in harness/memory layers, enabling transfer as a module-reattachment problem rather than retraining (2026-04).
• A policy's effectiveness is *fit to the agent's compression tolerance*, not just headline capability; stronger agents tolerate lower-fidelity preservation, weaker agents need aggressive pruning for coherence (~2025–2026).
• Context policy transfer fails across capability divergence—a policy tuned for a strong agent starves a weak one, and vice versa; matching capability tier is the load-bearing condition for safety (2025–2026).
• Token budget often explains multi-agent performance variance better than coordination cleverness, suggesting context policies (as token-allocation rules) generalize across comparable-sized agents more readily than task-specific behavior (2025).
• Continuous memory updates degrade reliability: "useful memories become faulty when continuously updated by LLMs" (2026-05), implying fixed, verified context policies may transfer better than adaptive ones.

Anchor papers (verify; mind their dates):
- arXiv:2604.08224 (2026-04) — Externalization in LLM Agents: Memory, Skills, Protocols, and Harness
- arXiv:2605.30785 (2026-05) — Learning Agent-Compatible Context Management for Long-Horizon Tasks
- arXiv:2506.02153 (2025-06) — Small Language Models are the Future of Agentic AI
- arXiv:2509.20175 (2025-09) — Federation of Agents: Semantics-Aware Communication Fabric

Your task:
(1) RE-TEST EACH CONSTRAINT. For each claim above, judge whether newer models, training methods, tooling (memory SDKs, harness frameworks), or evaluation harnesses have relaxed or overturned it since early 2026. Separate the durable question ("does policy *structure* transfer?") from the perishable limitation ("does it work across broad capability gaps?"). Cite what resolved each constraint, or state plainly where it still holds.
(2) Surface the strongest work from the last ~6 months that contradicts the "similar capability ⟹ safe transfer" thesis—e.g., evidence that policy transfer succeeds *despite* capability mismatch, or fails *despite* matching tiers.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do versioned capability vectors + compression-tolerance profiling now replace capability-tier matching?" or "Has continuous learning in harness layers made fixed context policies obsolete?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can one agent's recipe for deciding what to remember and forget safely transfer to a different agent of equal skill?

Related lines of inquiry

Sources 8 notes

Papers this line draws on 8