Can agent skills move from prompts to trainable parameters?
This explores whether the procedural know-how an agent uses — its 'skills' — has to live in the prompt as text, or whether it can be baked into the model's weights (or stored elsewhere entirely), and what each choice costs.
This explores whether agent skills must stay as text in the prompt or can become trainable parameters — and the corpus shows this is a live, three-way design tension, not a settled yes. The most direct answer is yes: skills can be compiled out of the prompt and into the weights. Can skills work better as weights than as prompts? describes a hypernetwork that turns plain-text skills into plug-and-play LoRA adapters, cutting prefill tokens by 64–72% while matching or beating the in-context version. The payoff isn't just token savings: once a skill is a set of weights, you can do arithmetic on it — add, scale, and combine skills as composable vectors in a way you can't do with paragraphs of instructions.
But moving to weights buys problems the prompt didn't have. The headline risk is forgetting: weight updates tend to overwrite old competence as they install new. Can agents learn new skills without forgetting old ones? (VOYAGER) makes the opposing bet — keep skills as executable entries in an external, embedding-indexed library and compose complex ones from simpler ones, precisely to dodge the catastrophic forgetting that weight-update methods suffer. Can agents learn reusable sub-task routines from past experience? sits in the same camp: it induces reusable sub-task routines from past traces and compounds them hierarchically, gaining 24–51% without touching the model's parameters. So 'trainable parameters' is one of at least three storage choices for a skill — prompt text, external library, or weights — each with different forgetting, composability, and cost profiles.
There's a deeper school that says skills shouldn't live in the model at all. Where does agent reliability actually come from? argues reliability comes from pushing three burdens — memory, skills, and protocols — out of the model and into a surrounding harness, so the model doesn't re-solve the same problems every run. Can agents learn continuously from experience without updating weights? pushes this to its limit: AgentFly hits 87.88% on GAIA by doing all policy improvement through memory operations with the LLM's parameters frozen. From this angle, 'move skills into parameters' is almost backwards — the trend is to keep the model frozen and make everything around it learn.
If you do train skills into a model, the corpus warns the bottleneck is rarely the weights themselves. Can agents learn beyond what their training data shows? shows that skills learned only from static expert demonstrations stay capped at what the dataset's curators imagined, because the agent never fails in a live environment. Can you turn an LLM into an agent by just fine-tuning? makes the same point structurally: turning an LLM into an action-taker isn't 'just fine-tune' — it needs data curation, action grounding, harness integration, and safety evaluation as distinct stages. And Can delegation teach models to manage context more actively? offers an intriguing hint that some skills genuinely *do* belong in weights: training a model to delegate produced a transferable discipline that carried over to single-agent tasks — evidence that a skill trained into parameters can generalize in ways a pasted-in prompt routine can't.
The thing you didn't know you wanted to know: the prompt-vs-parameter question is really a question about *where the learning happens*. Weights give you composability and token savings but risk forgetting and lock you to your training data's imagination; external libraries and memory give you lifelong accumulation and a frozen, debuggable model but lean on a heavier harness. The frontier work isn't picking a side — it's compiling skills into LoRA adapters you can do math on, while keeping the base model frozen so it never forgets.
Sources 8 notes
LatentSkill uses a hypernetwork to convert textual agent skills into plug-and-play LoRA adapters, reducing prefill tokens by 64–72% while maintaining or beating in-context baselines. Weight-space skills form composable semantic structures that can be scaled and combined through parameter arithmetic.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Converting LLMs to action-capable systems requires four distinct stages: curating action-environment-user datasets, training for action grounding, integrating agent infrastructure with memory and tools, and rigorous safety evaluation. The surrounding system and harness determine whether actions are grounded or hallucinated.
SearchSwarm shows that training models to delegate subtasks and integrate summarized results beats passive compression, with a 30B model matching much larger ones. Critically, the delegation skill transfers to single-agent tasks, suggesting it teaches disciplined decomposition and evidence grounding, not just orchestration.