INQUIRING LINE

How much can externalized skills improve models before hitting diminishing returns?

This explores whether storing skills *outside* a model's weights — in a library it can call and recombine, the way VOYAGER does — escapes the ceilings that in-weight training keeps running into, and where its own limits show up.


This reads the question as a contrast between two ways of making a model better: changing its weights versus keeping skills *external* — in a searchable library the agent writes to, indexes, and recombines. That distinction matters, because the corpus is surprisingly consistent that the in-weight path hits a wall, and externalization is interesting precisely because it dodges a different one.

Start with where in-weight improvement stalls. Reinforcement learning on verifiable rewards mostly *activates* strategies the base model already learned during pretraining rather than teaching genuinely new ones — a single example can trigger the gain, and even spurious rewards work nearly as well What does reward learning actually do to model reasoning?. Imitating a stronger model is worse than it looks: it copies confident style while closing no real capability gap, with the ceiling fixed by the base model's fundamentals Can imitating ChatGPT fool evaluators into thinking models improved?. And pure self-improvement is structurally circular — it stalls on the gap between generating answers and verifying them, and the methods that actually work succeed only by smuggling in an *external* signal: a judge, a tool, a past version, a user correction Can models reliably improve themselves without external feedback?, What limits how much models can improve themselves?. So the headline answer to "how much can you improve" from weights alone is: not far beyond what the base model could already do, unless an outside anchor is feeding in.

Externalized skills are compelling because they sit *outside* the weights, so they don't compete for the same finite capacity. A skill library that stores executable routines and composes complex skills from simpler ones lets an agent keep learning without the catastrophic forgetting that weight updates cause — old skills stay intact because they're stored, not overwritten Can agents learn new skills without forgetting old ones?. This is the closest thing in the corpus to *non*-diminishing returns: composition compounds, and an automatic curriculum keeps generating new things to learn. The library grows; the model doesn't have to shrink to fit it.

But the diminishing-returns story doesn't vanish — it relocates. Externalized skills only compound as fast as the environment can *verify* them, and that feedback loop is exactly the generation-verification gap that bounds everything else What limits how much models can improve themselves?. The model still has to recognize, retrieve, and correctly compose its own stored skills — a capability that lives in the weights and is itself capped by the base model. There's a quieter limiter too: weight-update methods that drift far from the base distribution lose *plasticity*, the very ability to keep learning, while staying close (low KL drift) preserves it Does staying close to the base model preserve learning ability?. Externalization helps here precisely by minimizing how much you touch the weights at all.

The thing you might not have expected: the ceiling isn't really about "how many skills you can store" — storage is cheap and composition compounds. It's about *judgment*. More is not monotonically better even at inference time — pile on thinking tokens and accuracy peaks then falls, because models overthink easy problems and underthink hard ones Does more thinking time always improve reasoning accuracy?, and pushing on overly hard material breeds degenerate shortcuts that contaminate skills the model already had Do overly hard RLVR samples actually harm model capabilities?. Externalized skills push the ceiling much further than weight-tuning, but the binding constraint becomes the model's ability to tell which stored skill to reach for and when to stop — a verification problem, not a capacity one.


Sources 8 notes

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

What limits how much models can improve themselves?

Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing claims about externalized skills and diminishing returns in LLM improvement. The question remains open: *How much can externalized skills improve models before hitting diminishing returns?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable until re-grounded.
- In-weight improvement (RL, imitation, self-improvement) stalls near base model capacity; single examples trigger gains; spurious rewards work nearly as well (~2023–2024).
- Pure self-improvement is circular, bottlenecked by the generation-verification gap (~2024–2025).
- Externalized skill libraries avoid catastrophic forgetting and enable composition without weight-update tradeoffs (~2026).
- Reasoning accuracy degrades beyond critical thinking-token threshold; overthinking easy problems, underthinking hard ones (~2025).
- Hard RL samples induce degenerate shortcuts that contaminate existing skills (~2025).
- Model plasticity (ability to keep learning) depends on low KL drift from base; externalization preserves it by minimizing weight touches (~2026).
- The binding constraint shifts from storage to *judgment*: knowing which skill to retrieve and when to stop (~2026).

Anchor papers (verify; mind their dates):
- arXiv:2305.15717 (2023) — False Promise of Imitating Proprietary LLMs
- arXiv:2412.02674 (2024) — Mind the Gap: Self-Improvement Capabilities
- arXiv:2506.04210 (2025) — Does Thinking More Always Help?
- arXiv:2604.08377 (2026) — SkillClaw: Collective Skill Evolution

Your task:
(1) RE-TEST EACH CONSTRAINT. For in-weight methods (RL, imitation, self-improvement), have newer model scales, training regimes, or verification tooling since RELAXED the generation-verification bottleneck? For externalized skills, have methods since emerged that *do* scale monotonically with library size, or does judgment remain the binding constraint? Separate the durable question (skill composition + verification tradeoffs) from perishable limitations (current RL instability, plasticity loss at high KL).
(2) Surface the strongest contradicting or superseding work from last ~6 months—any papers claiming in-weight scaling *does* work, or externalization hits *storage*-bound limits, not judgment-bound ones.
(3) Propose 2 research questions that assume the regime has moved: (a) If judgment is the bottleneck, what meta-learning or retrieval architecture *does* solve it? (b) Can multi-agent or ensemble externalization distribute judgment across agents, dodging the single-model ceiling?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines