How much can externalized skills improve models before hitting diminishing returns?
This explores whether storing skills *outside* a model's weights — in a library it can call and recombine, the way VOYAGER does — escapes the ceilings that in-weight training keeps running into, and where its own limits show up.
This reads the question as a contrast between two ways of making a model better: changing its weights versus keeping skills *external* — in a searchable library the agent writes to, indexes, and recombines. That distinction matters, because the corpus is surprisingly consistent that the in-weight path hits a wall, and externalization is interesting precisely because it dodges a different one.
Start with where in-weight improvement stalls. Reinforcement learning on verifiable rewards mostly *activates* strategies the base model already learned during pretraining rather than teaching genuinely new ones — a single example can trigger the gain, and even spurious rewards work nearly as well What does reward learning actually do to model reasoning?. Imitating a stronger model is worse than it looks: it copies confident style while closing no real capability gap, with the ceiling fixed by the base model's fundamentals Can imitating ChatGPT fool evaluators into thinking models improved?. And pure self-improvement is structurally circular — it stalls on the gap between generating answers and verifying them, and the methods that actually work succeed only by smuggling in an *external* signal: a judge, a tool, a past version, a user correction Can models reliably improve themselves without external feedback?, What limits how much models can improve themselves?. So the headline answer to "how much can you improve" from weights alone is: not far beyond what the base model could already do, unless an outside anchor is feeding in.
Externalized skills are compelling because they sit *outside* the weights, so they don't compete for the same finite capacity. A skill library that stores executable routines and composes complex skills from simpler ones lets an agent keep learning without the catastrophic forgetting that weight updates cause — old skills stay intact because they're stored, not overwritten Can agents learn new skills without forgetting old ones?. This is the closest thing in the corpus to *non*-diminishing returns: composition compounds, and an automatic curriculum keeps generating new things to learn. The library grows; the model doesn't have to shrink to fit it.
But the diminishing-returns story doesn't vanish — it relocates. Externalized skills only compound as fast as the environment can *verify* them, and that feedback loop is exactly the generation-verification gap that bounds everything else What limits how much models can improve themselves?. The model still has to recognize, retrieve, and correctly compose its own stored skills — a capability that lives in the weights and is itself capped by the base model. There's a quieter limiter too: weight-update methods that drift far from the base distribution lose *plasticity*, the very ability to keep learning, while staying close (low KL drift) preserves it Does staying close to the base model preserve learning ability?. Externalization helps here precisely by minimizing how much you touch the weights at all.
The thing you might not have expected: the ceiling isn't really about "how many skills you can store" — storage is cheap and composition compounds. It's about *judgment*. More is not monotonically better even at inference time — pile on thinking tokens and accuracy peaks then falls, because models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?, and pushing on overly hard material breeds degenerate shortcuts that contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. Externalized skills push the ceiling much further than weight-tuning, but the binding constraint becomes the model's ability to tell which stored skill to reach for and when to stop — a verification problem, not a capacity one.
Sources 8 notes
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.
VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.
FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.