SYNTHESIS NOTE

What predicts success in ultra-long-horizon agent tasks?

Does an agent's initial solution quality matter more than its willingness to iterate? AUTOLAB's frontier-model benchmark suggests persistence through feedback loops may be the true differentiator.

Synthesis note · 2026-06-27 · sourced from Evaluations

AUTOLAB reframes what a long-horizon agent benchmark should test. Most agentic evals score either single-turn responses or short interactive trajectories; AUTOLAB instead hands the agent a correct but deliberately suboptimal baseline across 36 expert-curated tasks (system optimization, CUDA kernels, model development, puzzles) and asks it to improve the artifact within a strict wall-clock budget. The striking empirical result, across 17 frontier models, is that the dominant predictor of success is not the quality of the agent's initial attempt but its persistence — its willingness to repeatedly benchmark, edit, and incorporate noisy empirical feedback over many cycles. Most models, including proprietary ones, either terminate prematurely or exhaust their budget with minimal progress; claude-opus-4.6 is called out as a strong exception.

This is a sharper, more operational claim than "agents should iterate." It says the binding constraint is a behavioral disposition toward sustained empirical grounding, and that disposition is unevenly distributed across models that look comparable on one-shot benchmarks. It grounds Should agent evaluation measure more than task success? with a concrete trajectory-level predictor, and it sits naturally alongside Does raw token spending actually predict agent performance? — persistence only pays if each loop returns informative, retained feedback, otherwise it is budget-burning churn, not progress.

The mechanism cuts against itself, however. Do models fail worse when their own errors fill the context? implies that more loops mean more accumulated mistakes in context, which should degrade the very iteration AUTOLAB rewards. The reconciliation is probably that persistence pays only when paired with calibrated scoring that lets the agent see whether an edit actually helped — pure persistence without trustworthy feedback would amplify error. That is why the authors single out harness design as the promising lever: the harness, not the backbone alone, decides whether long horizons compound feedback or compound noise.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 125 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

on ultra-long-horizon optimization the predictor of agent success is persistence in the feedback loop not the quality of the first attempt