How should we allocate model budget between evolvers and harness users?
This explores how to split a fixed model budget between the work of evolving/updating a harness (writing the protocols, skills, memory edits) and the work of actually using that harness to do tasks — and whether those two jobs reward different model sizes.
This explores how to split a fixed model budget between the work of *evolving* a harness — writing the protocols, skills, and memory edits — and the work of *using* that harness to get tasks done. The corpus has a surprisingly sharp answer to the first half: the capacity to produce useful harness updates is roughly flat across model tiers, while the capacity to *benefit* from those updates follows an inverted U, peaking at mid-tier models Do stronger models always evolve harnesses better?. Weak models can't reliably invoke the harness they're handed; very strong models chafe against faithfully following externalized instructions. That single finding reframes the whole budgeting question — if evolving is flat but benefiting is peaked, you don't need to spend your best model on writing the harness.
The practical implication is a deliberate asymmetry: pay your premium tokens to the *users*, not the *evolvers*. Since any tier can draft a competent harness edit, the evolver role is a place to economize — a cheaper or smaller model can generate protocol and skill updates without much loss. The expensive, high-value compute belongs where the inverted-U peaks: the mid-tier agents actually executing tasks against the harness. This is the same logic the diversity literature reaches from another direction — smaller models around 500M parameters generate more unique outputs per sample than large ones, because big models concentrate probability mass and collapse variety Why aren't bigger models better for generating diverse outputs?. If part of evolving a harness is proposing many candidate edits to select from, smaller generators may literally explore better per dollar.
There's a deeper warning lurking here, though: don't let the evolvers run on a closed loop. Pure self-improvement — a model rewriting its own harness off its own judgment — hits structural limits from the generation-verification gap, diversity collapse, and reward hacking; the methods that actually work smuggle in external anchors like past versions, third-party judges, or user corrections Can models reliably improve themselves without external feedback?. So a slice of budget should go not to *more* evolver compute but to *verifier* compute — and reward models themselves improve markedly when allowed to reason before scoring, which turns evaluation into its own test-time-scaling axis Can reward models benefit from reasoning before scoring?. The evolver/user split is really a three-way split: generate, use, and verify.
Stepping back, the corpus reframes "allocation" itself as adaptive rather than fixed. The compute-optimal scaling work shows that spending the *same* total budget adaptively — less on easy prompts, more on hard ones — beats a uniform split and can even beat a larger model under a flat budget Can we allocate inference compute based on prompt difficulty? How should we allocate compute budget at inference time?. Applied here, that means don't fund evolvers and users at a static ratio; fund harness evolution when the agents are visibly struggling (hard, novel task regimes) and starve it when the harness is already carrying the load. Inference compute and parameter scaling are not independent resources Can inference compute replace scaling up model size?, and the same is true of evolver and user budget — they trade against each other on the margin.
Finally, if budget keeps growing past the point where refining one configuration helps, the population literature suggests spending it on *many* rather than *bigger*: after single-model pretraining saturates, aggregating a diverse population reaches lower loss than refining one model further Should extra compute refine one model or build many?, and evolutionary search at inference time — many candidates kept diverse by an island model — outperforms best-of-N and sequential revision Can evolutionary search beat sampling and revision at inference time?. The unexpected takeaway: the best use of a marginal token is rarely a stronger evolver. It's a cheaper, more diverse pool of evolvers, an external verifier to keep them honest, and your premium compute reserved for the mid-tier agents who actually metabolize the harness into results.
Sources 9 notes
Model capability to produce useful harness edits stays constant across tiers, but capacity to actually benefit from those edits follows an inverted U-shape, peaking in mid-tier models. Weak models fail to invoke harnesses; strong models struggle with faithful instruction-following.
Research shows that for synthetic data generation, models around 500M parameters outperform larger ones in output diversity per sample. Larger models concentrate probability mass on preferred outputs, reducing the variety of distinct samples generated within a fixed budget.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
Research shows inference effectiveness varies dramatically by prompt difficulty. Reallocating the same total compute adaptively—giving easy prompts less and hard ones more—substantially outperforms larger models under uniform budgets.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Once single-model pretraining saturates, aggregating predictions from a diverse population of models reaches lower validation loss than further refining one model. Anti-correlated learning-rate and weight-decay schedules plus chain distillation enable this efficiently, matching 256-epoch ensembles with ~56 epochs.
Mind Evolution uses genetic algorithms with LLM-generated mutations and crossovers to significantly outperform Best-of-N and Sequential Revision on planning benchmarks. An island model sustains population diversity, preventing the premature convergence that single-trajectory refinement exhibits.