SYNTHESIS NOTE

Topics›Agents Multi Architecture›this note

Can a separate trained curator improve skill libraries better than frozen agents?

Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.

Synthesis note · 2026-05-18 · sourced from Agents Multi Architecture

Reusable skills distilled from experience provide a natural substrate for self-evolving agents. The bottleneck is not whether to maintain a skill library but who curates it well. Manual curation demands expertise that does not scale to task diversity. Heuristic/prompting-based curation lacks downstream feedback. Existing RL approaches train short-horizon skill operations and miss the long-term curation policy needed for skill update and deletion.

SkillOS (2605.06614) makes two architectural decisions that combine into a third surprising result. First, decouple the trainable skill curator from the agent executor — the executor stays frozen, retrieves and applies skills, while a separate trainable curator updates the SkillRepo from accumulated experience. This makes the curator a modular component that can be optimized without retraining the underlying agent.

Second, group related tasks into training streams to provide long-horizon learning signals. Earlier trajectories update the SkillRepo; later related tasks evaluate those updates. The grouping exploits skill-relevant task dependencies — what was learned on one task is tested on adjacent tasks. Composite rewards combine downstream executor feedback with intermediate signals to attribute outcomes to specific curation decisions.

The surprising result is what the skill repository evolves into. Early in training, the curator introduces generic sections — additional guidance, tips, recommendations — that make skills more verbose without operational improvement. As training progresses, the additions shift toward actionable structures: failure-handling logic, conditional branches specifying when to deviate from defaults. Even more notably, the global organization evolves: early repositories contain narrow task-specific skills, later repositories contain meta-strategy skills covering verification, fallback planning, system search, and strategy adjustment. The curator does not merely accumulate skills — it progressively expands the repository's strategic space toward compositional cross-task control knowledge.

The most consequential downstream finding is curator generalization. The trained skill curator outperforms frontier models' zero-shot curation ability AND generalizes across different executor backbones and task domains. The curator-as-module hypothesis is empirically validated: skill curation is a distinct learnable skill, transferable independently of the executor it was trained against.

This pairs structurally with Should successful and failed episodes be processed differently? — both are RL-for-skill approaches but along different axes. SkillRL differentiates what gets stored (success demos vs failure lessons). SkillOS differentiates who learns from the storage (curator vs executor). The two are complementary: SkillRL's asymmetric trajectory processing is a candidate ingredient inside SkillOS's curator. Both contribute to the condition-preservation hypothesis: the right architecture for trajectory-based learning preserves applicability conditions through structural choices rather than relying on consolidation correctness.

The architectural implication: agent self-evolution decomposes into at least three trainable subsystems — executor (rarely retrained), skill curator (RL-trained), skill repository (the artifact). The Agentic RL survey's claim that "memory becomes RL-optimizable" extends here to "skill curation becomes RL-optimizable" as a distinct optimizable axis.

SkillOpt arrives at the same frozen-executor / trainable-skill decomposition from the optimization side, and tightens it. Rather than an RL curation policy, SkillOpt treats the skill document as the external state of a frozen agent and runs a text-space optimizer: a separate optimizer model converts scored rollouts into bounded add/delete/replace edits, gated by a held-out validation score — the same curator-executor split, but disciplined like weight-space training (textual learning rate, rejected-edit buffer as negative feedback, epoch-wise slow/meta update). It also strengthens the cross-harness transfer SkillOS gestures at: across six benchmarks, seven models, and three execution harnesses (direct chat, Codex, Claude Code), a Codex-trained spreadsheet skill transfers to Claude Code for a +59.7 point gain, and the deployed skill adds zero inference-time model calls. SkillOS = RL-learned curation policy; SkillOpt = validation-gated text optimization — two routes to the same frozen-agent-with-trainable-skills architecture.

Inquiring lines that read this note 55

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do self-generated feedback mechanisms enable effective model learning?

How can AI agents autonomously learn and transfer skills across tasks?

What drives capability and cost efficiency in agent systems?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

What role does environment diversity play in preventing agents from overfitting to curator imagination?

How does memorization interact with learning and generalization?

Can curated demonstrations compensate for smaller or simpler training environments?

How does AI adoption affect human skill development and labor equality?

Does outsourcing tasks to AI reduce opportunities for skill development?

Why do readers trust citations and complexity regardless of accuracy?

What makes provenance infrastructure more critical than artifact quality?

Can ensemble evaluation methods reduce bias more than single judges?

How do composite rewards attribute curation outcomes to specific skill library changes?

How should agents balance memory condensation to optimize context efficiency?

How should systems govern persistent agent-generated code in shared infrastructure?

Does externalizing cognitive work and state improve agent reliability?

Why do reward structures fail to shape long-term agent learning?

How do you prevent stale reward signals when skills evolve during deployment?

Why does verification consistently lag behind AI generation?

How do skills authored in-loop validate faster than offline generated skills?

What determines success in training models on multiple tasks?

Can extracted skills transfer effectively across different domains and model architectures?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How should memory consolidation strategies shape agent performance over time?

Do harness improvements transfer across model scales or memorize shortcuts?

Can smaller models produce skill updates as useful as frontier model updates?

What coordination failures limit multi-agent LLM systems as they scale?

What breaks when multiple agents share and revise the same artifacts?

What memory abstraction level best enables agent knowledge reuse?

Can workflow memory compound reusable skills into measurable success improvements?

How do prompt structure and constraints affect model instruction reliability?

How can we reorganize repositories to make behaviors easier to locate?

How should retrieval systems optimize for multi-step reasoning during inference?

What makes skills suitable for retrieval and chaining in repositories?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 112 in 2-hop network ·medium cluster Open in graph ↗

Can a separate trained curator improve skill lib… Should successful and failed episodes be processed… How does treating LLMs as multi-step agents change… Can agents learn reusable sub-task routines from p… Can agents adapt without pausing service to users? Can skill documents be optimized like neural netwo…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Should successful and failed episodes be processed differently? Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.
SkillRL is the asymmetric-trajectory variant; SkillOS is the curator-decoupling variant; complementary axes of skill-RL design
How does treating LLMs as multi-step agents change what we can optimize? Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
SkillOS is one specific instantiation of the "capabilities become RL-optimizable subsystems" pattern, with skill curation as the optimized capability
Can agents learn reusable sub-task routines from past experience? Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
AWM provides the workflow-extraction mechanism; SkillOS provides the curation-policy training; the two together describe both extraction and selection of skills
Can agents adapt without pausing service to users? Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw decomposes adaptation across timescales; SkillOS decomposes it across roles (curator/executor); both extract subsystems that can be independently optimized
Can skill documents be optimized like neural network weights? Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?
synthesizes: same frozen-executor/trainable-skill split reached via text-space optimizer rather than RL curation

Can a separate trained curator improve skill libraries better than frozen agents?

Inquiring lines that read this note 55

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4