Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents

Paper · arXiv 2605.30621 · Published May 28, 2026
Evolutionary Methods

LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model’s base capability in tasksolving predicts its capabilities in harness selfevolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B’s updates yield gains comparable to those of Claude Opus 4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier.

Introduction. Large language models (LLMs) (Radford et al., 2018; Touvron et al., 2023) have become a general-purpose foundation for language understanding (Hendrycks et al., 2020), reasoning (Wang et al., 2025), and task solving (Zhou et al., 2025). Increasingly, they also power agentic systems that interact with external environments, call tools, operate software interfaces, and complete long-horizon tasks (Yang et al., 2024b; Merrill et al., 2026). In these settings, system behavior depends not only on the underlying model but also on an external agent harness: prompts (Wei et al., 2022), skills (Xia et al., 2026), memories (Yan et al., 2025), tools (Qin et al., 2024), etc., that shape how the model observes, reasons, acts, and recovers from errors. Improving an agentic system increasingly means refining not only the foundation model, but also the editable harness around it. In current practice, harnesses are typically designed by hand.

Discussion / Conclusion. We analyze harness self-evolution by decomposing it into two model capabilities distinct from base capability: harness-updating, the capability to produce harness updates, and harness-benefit, the capability to benefit from updated harnesses during task solving. Across seven LLMs and three benchmarks, harness-updating is flat in base capability: models across capability tiers produce updates that yield similar gains, and even the Qwen3.5-9B evolver induces gains comparable to Claude Opus 4.6. In contrast, harness-benefit is non-monotonic in base capability: weak-tier models gain little, traced to two failure modes: failing to activate relevant harness artifacts and failing to follow them faithfully once activated. These findings motivate investing capability budget in the agent rather than the evolver, and targeting agent training at harness invocation and long-horizon instruction following.