Can a separate trained curator improve skill libraries better than frozen agents?
Explores whether decoupling skill curation from agent execution enables better long-term learning of what skills to keep, delete, or refine. Matters because manual curation doesn't scale and heuristic approaches lack feedback.
Reusable skills distilled from experience provide a natural substrate for self-evolving agents. The bottleneck is not whether to maintain a skill library but who curates it well. Manual curation demands expertise that does not scale to task diversity. Heuristic/prompting-based curation lacks downstream feedback. Existing RL approaches train short-horizon skill operations and miss the long-term curation policy needed for skill update and deletion.
SkillOS (2605.06614) makes two architectural decisions that combine into a third surprising result. First, decouple the trainable skill curator from the agent executor — the executor stays frozen, retrieves and applies skills, while a separate trainable curator updates the SkillRepo from accumulated experience. This makes the curator a modular component that can be optimized without retraining the underlying agent.
Second, group related tasks into training streams to provide long-horizon learning signals. Earlier trajectories update the SkillRepo; later related tasks evaluate those updates. The grouping exploits skill-relevant task dependencies — what was learned on one task is tested on adjacent tasks. Composite rewards combine downstream executor feedback with intermediate signals to attribute outcomes to specific curation decisions.
The surprising result is what the skill repository evolves into. Early in training, the curator introduces generic sections — additional guidance, tips, recommendations — that make skills more verbose without operational improvement. As training progresses, the additions shift toward actionable structures: failure-handling logic, conditional branches specifying when to deviate from defaults. Even more notably, the global organization evolves: early repositories contain narrow task-specific skills, later repositories contain meta-strategy skills covering verification, fallback planning, system search, and strategy adjustment. The curator does not merely accumulate skills — it progressively expands the repository's strategic space toward compositional cross-task control knowledge.
The most consequential downstream finding is curator generalization. The trained skill curator outperforms frontier models' zero-shot curation ability AND generalizes across different executor backbones and task domains. The curator-as-module hypothesis is empirically validated: skill curation is a distinct learnable skill, transferable independently of the executor it was trained against.
This pairs structurally with Should successful and failed episodes be processed differently? — both are RL-for-skill approaches but along different axes. SkillRL differentiates what gets stored (success demos vs failure lessons). SkillOS differentiates who learns from the storage (curator vs executor). The two are complementary: SkillRL's asymmetric trajectory processing is a candidate ingredient inside SkillOS's curator. Both contribute to the condition-preservation hypothesis: the right architecture for trajectory-based learning preserves applicability conditions through structural choices rather than relying on consolidation correctness.
The architectural implication: agent self-evolution decomposes into at least three trainable subsystems — executor (rarely retrained), skill curator (RL-trained), skill repository (the artifact). The Agentic RL survey's claim that "memory becomes RL-optimizable" extends here to "skill curation becomes RL-optimizable" as a distinct optimizable axis.
SkillOpt arrives at the same frozen-executor / trainable-skill decomposition from the optimization side, and tightens it. Rather than an RL curation policy, SkillOpt treats the skill document as the external state of a frozen agent and runs a text-space optimizer: a separate optimizer model converts scored rollouts into bounded add/delete/replace edits, gated by a held-out validation score — the same curator-executor split, but disciplined like weight-space training (textual learning rate, rejected-edit buffer as negative feedback, epoch-wise slow/meta update). It also strengthens the cross-harness transfer SkillOS gestures at: across six benchmarks, seven models, and three execution harnesses (direct chat, Codex, Claude Code), a Codex-trained spreadsheet skill transfers to Claude Code for a +59.7 point gain, and the deployed skill adds zero inference-time model calls. SkillOS = RL-learned curation policy; SkillOpt = validation-gated text optimization — two routes to the same frozen-agent-with-trainable-skills architecture.
Inquiring lines that use this note as a source 37
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does extended exoskeleton use eventually produce meaningful skill transfer?
- Which AI interaction patterns preserve learning while which ones degrade skill formation?
- When should you optimize agent behavior versus tool performance separately?
- Can tool adaptation work without freezing the agent in the loop?
- What role does environment diversity play in preventing agents from overfitting to curator imagination?
- Can curated demonstrations compensate for smaller or simpler training environments?
- Can agentic reasoning outperform rigid rule-based systems for skill refinement?
- How much does agent performance depend on demonstration quantity versus curation quality?
- Does outsourcing tasks to AI reduce opportunities for skill development?
- What infrastructure decouples generation from training in asynchronous agent loops?
- Can a static evaluator become the performance ceiling for an improving actor?
- Should agent capability be optimized separately from general capability?
- Can agentic AI tools deliver productivity gains on learning tasks differently?
- What makes provenance infrastructure more critical than artifact quality?
- How do task stream groupings provide long-horizon learning signals for curation decisions?
- Can curator modules trained on one executor transfer to entirely different agent backbones?
- How do composite rewards attribute curation outcomes to specific skill library changes?
- Should agents continuously prune irrelevant links during execution?
- How do agents decide which created code should persist versus disappear?
- Can individual skills improve through reuse and accumulate experience across tasks?
- Do learned workflows transfer between different agents with minimal accuracy loss?
- How do agents automatically generate suitable learning tasks based on current capability?
- Can skill validation through testing prevent unreliable programs from accumulating?
- How do you prevent stale reward signals when skills evolve during deployment?
- Can skill libraries prevent redundant narrow artifacts from proliferating?
- How do skills authored in-loop validate faster than offline generated skills?
- What lifecycle management prevents in-loop skill creation from bloating an agent?
- Does self-play feedback improve skills created from the agent's own experience?
- What training method supports dynamic tool discovery in long-horizon agents?
- Can extracted skills transfer effectively across different domains and model architectures?
- Why does decomposition ability transfer across domains but solving ability does not?
- Why do agents systematically underuse condensed experience in skill documents?
- What makes memory curation harder to solve than simply expanding storage?
- What makes persistent, shared code artifacts from agents hard to manage at scale?
- What specific bookkeeping tasks can environments maintain more reliably than policies?
- Should we train the evolver or the executor when building self-improving agents?
- Can smaller models produce skill updates as useful as frontier model updates?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Should successful and failed episodes be processed differently?
Explores whether asymmetric treatment of trajectories—preserving successes as full demonstrations while abstracting failures into lessons—could improve both the utility and efficiency of memory in reinforcement learning agents.
SkillRL is the asymmetric-trajectory variant; SkillOS is the curator-decoupling variant; complementary axes of skill-RL design
-
How does treating LLMs as multi-step agents change what we can optimize?
Instead of optimizing single prompt-response pairs, what happens when we model LLM agents as temporally-extended decision processes? The question matters because it shifts what becomes trainable.
SkillOS is one specific instantiation of the "capabilities become RL-optimizable subsystems" pattern, with skill curation as the optimized capability
-
Can agents learn reusable sub-task routines from past experience?
Do web agents fail at long-horizon tasks because they cannot extract and reuse workflows shared across similar problems? This explores whether sub-task abstraction enables skill accumulation rather than task-by-task problem solving.
AWM provides the workflow-extraction mechanism; SkillOS provides the curation-policy training; the two together describe both extraction and selection of skills
-
Can agents adapt without pausing service to users?
Can deployed LLM agents continuously improve their capabilities while serving users without interruption? This explores whether fast behavioral updates and slow policy learning can coexist across different timescales.
MetaClaw decomposes adaptation across timescales; SkillOS decomposes it across roles (curator/executor); both extract subsystems that can be independently optimized
-
Can skill documents be optimized like neural network weights?
Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?
synthesizes: same frozen-executor/trainable-skill split reached via text-space optimizer rather than RL curation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- SkillOS: Learning Skill Curation for Self-Evolving Agents
- SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
- MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
- MetaClaw: Just Talk — An Agent That Meta-Learns and Evolves in the Wild
- SkillOpt: Executive Strategy for Self-Evolving Agent Skills
- Adaptation of Agentic AI
- Hyperagents
- Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Original note title
RL-trained skill curation decoupled from frozen executor produces repositories that evolve from generic guidance toward execution-oriented refinement and meta-strategy skills