INQUIRING LINE

How do agents decide which created code should persist versus disappear?

This explores how autonomous agents manage the lifecycle of code they write themselves — what gets kept, reused, and shared versus discarded — which the corpus frames as memory management, skill curation, and an underexplored 'persistence' problem.


This explores how agents decide which of their own generated code should survive across tasks versus be thrown away — and it turns out the corpus treats this less as a coding question and more as a *memory and curation* question. The starting point is that agent-authored code that persists and gets shared is, frankly, the least-understood layer of the whole agentic stack What makes agent-created code artifacts so hard to manage?. The reason it matters is that code isn't just an output to be regenerated on demand — it's an executable, inspectable, stateful medium the agent reasons through Can code become the operational substrate for agent reasoning?. Once you see code as a substrate rather than a deliverable, 'should this persist?' becomes a real decision with consequences.

The clearest mechanism for keeping code is the skill library. VOYAGER stores working code as named, reusable skills in an embedding-indexed library and composes complex skills out of simpler ones, which lets it keep learning without the catastrophic forgetting that weight-updating methods suffer Can agents learn new skills without forgetting old ones?. So the first answer to 'what persists' is: code that *worked* — validated by environmental feedback — gets promoted into the library; everything else is scratch. But who does the promoting matters. SkillOS shows that handing curation to a *separately trained curator* (decoupled from the frozen executor that writes the code) shifts the library away from verbose generic additions toward sharp, actionable execution logic and cross-task meta-strategies Can a separate trained curator improve skill libraries better than frozen agents?. In other words, the keep/discard decision improves dramatically when it's a learned skill in its own right, not a side effect of generation.

Laterally, this is the same problem the memory-management literature is wrestling with under different vocabulary. One framing splits the decision into two paths: an explicit 'hot path' where the agent itself decides via tool calls what to store or delete, and an implicit background path triggered programmatically — trading context-sensitivity against reliability How should agents decide what memories to keep?. DeepAgent pushes the autonomous side further with 'memory folding,' compressing past interactions into structured episodic/working/tool schemas so the agent keeps what's strategically useful and sheds token overhead Can agents compress their own memory without losing critical details?. RAISE adds a useful nuance: memory (and by extension artifacts) decomposes by time scale and granularity, which predicts that different kinds of code should follow different retention policies rather than one global rule How should agent memory split across time scales?.

The quietly surprising thread is *economic*. In a 115-day persistent-agent study, 82.9% of tokens were cache reads, which flips the accounting: when context and code persist and get reused, the meaningful cost unit stops being the token and becomes the completed artifact Do persistent agents really cost less per token?. That reframes the persist-versus-disappear decision entirely — keeping code isn't just a capability play, it's how the whole economics of long-running agents works. And persistence cuts the other way too: the same long-lived environment logged 889 governance events with safeguards baked directly into the memory layer the agent consults while deciding, so what persists isn't only skills but the rules constraining what's allowed to persist Can governance rules embedded in runtime memory actually protect autonomous agents?.

One thing the corpus is honest about: the deletion and lifecycle side is genuinely under-researched. We have good stories for *promotion* (skills, folding, curators) but the open challenges still cluster around persistence, sharing, and lifecycle — which is exactly where the next gains in autonomy and coordination are expected to come from What makes agent-created code artifacts so hard to manage?. If you want a doorway into the trade-off itself, start with the two-path memory split and then read SkillOS to see what changes when curation becomes a trained skill rather than an afterthought.


Sources 9 notes

What makes agent-created code artifacts so hard to manage?

Of the three agentic code layers, agent-authored artifacts that persist and are shared across agents are underexplored in research. Open challenges cluster around persistence, sharing, and lifecycle management — exactly where future gains in autonomy and coordination may live.

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

How should agents decide what memories to keep?

Memory management decomposes into explicit hot-path (agent decides via tool calling) and implicit background (programmatically triggered) paths. Each approach trades context-sensitivity for reliability differently across generation, storage, retrieval, and deletion.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

How should agent memory split across time scales?

RAISE shows that agent memory consists of four components organized by two design axes: dialogue-level (conversation history, scratchpad) versus turn-level (examples, task trajectory). This granularity distinction predicts different failure modes and update policies for each component.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Can governance rules embedded in runtime memory actually protect autonomous agents?

A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tracking agent memory, code persistence, and curation mechanisms. The core question: *what determines whether agent-generated code survives across tasks or gets discarded, and how does that decision reshape agent architecture?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2025–05 through 2026–05. The library identified:
- Skill libraries (e.g., VOYAGER-style) persist code validated by environmental feedback; RL-trained curators (decoupled from executors) shift libraries from verbose to sharp, actionable logic (SkillOS, 2026–05).
- Agent memory splits into explicit 'hot path' (agent-initiated tool calls) and implicit background paths (programmatic triggers); autonomous memory folding compresses interactions into episodic/working/tool schemas (DeepAgent, 2026–10).
- Memory decomposes by time scale and granularity, predicting different retention policies for different code types (RAISE framework, circa 2026).
- In 115-day persistent-agent study: 82.9% of tokens were cache reads; the economic unit shifted from cost-per-token to cost-per-completed-artifact (2026–05).
- Governance rules embedded in the memory layer the agent consults constrain what persists; deletion and lifecycle management remain under-researched.

Anchor papers (verify; mind their dates):
- arXiv:2606.06614 (SkillOS: Learning Skill Curation for Self-Evolving Agents, 2026–05)
- arXiv:2610.21618 (DeepAgent: A General Reasoning Agent with Scalable Toolsets, 2026–10)
- arXiv:2605.26870 (Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study, 2026–05)
- arXiv:2604.08377 (SkillClaw: Let Skills Evolve Collectively with Agentic Evolver, 2026–04)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above — skill promotion, curator decoupling, dual-path memory, memory folding, time-scale decomposition, cache economics, embedded governance — assess whether newer model scaling, improved training regimes for curators, multi-agent orchestration (where code artifacts move between agents), or formal verification tools have since relaxed or overturned the limitation. Separate the durable question (how do agents *reason about persistence*?) from perishable claims (e.g., 'curators must be separately trained' — can end-to-end training now do it?). Cite what resolved it.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Does any recent paper argue that persistence is *harmful* (e.g., code drift, stale assumptions) or that stateless re-generation is now cheaper/safer?
(3) **Propose 2 research questions that assume the regime may have moved:** e.g., 'If multi-agent code-sharing becomes standard, how do agents vet code written by peers?' or 'Can a single learned pruning policy replace hand-tuned time-scale decomposition?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines