How do agents decide which created code deserves long-term persistence?
This explores how agents decide which code they write during a task is just scratch work versus what deserves to be saved, shared, and promoted into durable infrastructure — and the corpus suggests this 'lifecycle' decision is one of the least-settled problems in agent design.
This explores how agents decide which code they write during a task is just scratch work versus what deserves to be saved, shared, and promoted into durable infrastructure. The honest answer from the corpus is that nobody has nailed this yet: of the three kinds of code in an agentic system — capability baked into the model, the harness handed to the agent, and the artifacts the agent writes for itself — the agent-initiated, persistent, shared kind is the **least studied layer**, with open problems clustering exactly around lifecycle decisions, keeping shared state consistent, and promoting throwaway scripts into lasting tools What makes agent-authored code worth persisting and sharing?. So the question lands on a live frontier rather than a solved recipe.
The deeper reason persistence matters is that code isn't only an output — it's an executable, inspectable, stateful medium the agent can run, read back, and verify against Can code serve as the operational substrate for agent reasoning?. That reframes the keep/discard decision: code worth persisting is code that earns its place in the agent's working loop, not code that merely ran once. The clearest worked example is VOYAGER's skill library — it keeps a piece of code when the code becomes a reusable, composable skill, indexes it by embedding so it can be retrieved later, and refines it through environmental feedback. Crucially, only verified, working skills graduate into the library, which is also how the agent learns continuously without forgetting Can agents learn new skills without forgetting old ones?.
What's emerging is that 'should this persist?' may be too hard a judgment to leave to the agent that wrote the code. SkillOS splits the roles: a separate, *trained curator* decides what enters and evolves the repository while the executor stays frozen — and that decoupling shifts the library away from generic, verbose dumps toward genuinely actionable execution logic and cross-task meta-strategies Can a separate trained curator improve skill libraries better than frozen agents?. The lesson worth carrying: curation is its own skill, and an agent optimized to *do* the task isn't automatically good at *deciding what to keep* from it. Verification feeds this too — execution-free reasoning can now judge whether two pieces of agent code are equivalent at ~93% accuracy, cheap enough to gate what's worth promoting Can structured reasoning replace code execution for RL rewards?.
Zoom out and persistence is really one instance of a bigger pattern: reliable agents work by externalizing their cognitive load into memory, skills, and protocols held in the harness, rather than re-deriving everything from scratch each run Where does agent reliability actually come from?. Persisted code is the 'skills' slice of that externalization, and it sits alongside persisted *memory* — AgentFly shows agents can adapt continuously by writing to case and tool memory with no weight updates at all Can agents learn continuously from experience without updating weights?. Even governance rules persist best when written into the memory layer the agent actually consults mid-decision Can governance rules embedded in runtime memory actually protect autonomous agents?.
The most surprising turn is economic. In long-lived agent environments, the right question isn't 'which code is good' but 'which code pays for itself,' because once context and artifacts persist and get reused, the meaningful cost unit flips from cost-per-token to cost-per-completed-artifact — one 115-day study found ~83% of tokens were cache reads Do persistent agents really cost less per token?. And what predicts whether durable artifacts get built at all isn't initial brilliance but *persistence in the feedback loop* — repeatedly testing, editing, and re-incorporating, where most models quit early What predicts success in ultra-long-horizon agent tasks?. So 'what deserves to persist' increasingly looks less like a one-shot quality verdict and more like what survives repeated reuse and pays for its own storage.
Sources 10 notes
Of three agentic code elements, agent-initiated artifacts that persist and are shared across agents remain underexplored. Open challenges cluster around lifecycle decisions, shared state consistency, and promotion from scratch work to durable infrastructure.
Research shows code uniquely enables agent reasoning, action, and verification by being simultaneously executable, inspectable, and stateful. This unified code-centered loop improves reasoning and verification together compared to natural-language or prose-based approaches.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.
Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
A persistent agent recorded 889 governance events across 96 active days, with safeguards encoded directly into the memory layer the agent consulted during operation. Runtime-resident governance proved more effective than external policies because the agent actually accessed it during decision-making.
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.
Across 17 frontier models on 36 expert-curated optimization tasks, repeated benchmark-edit-incorporate cycles within a wall-clock budget proved the dominant success predictor. Most models terminated early or burned budget unproductively; Claude Opus 4.6 stood out as persistent.