SYNTHESIS NOTE

What predicts success in ultra-long-horizon agent tasks?

Does an agent's initial solution quality matter more than its willingness to iterate? AUTOLAB's frontier-model benchmark suggests persistence through feedback loops may be the true differentiator.

Synthesis note · 2026-06-27 · sourced from Evaluations

AUTOLAB reframes what a long-horizon agent benchmark should test. Most agentic evals score either single-turn responses or short interactive trajectories; AUTOLAB instead hands the agent a correct but deliberately suboptimal baseline across 36 expert-curated tasks (system optimization, CUDA kernels, model development, puzzles) and asks it to improve the artifact within a strict wall-clock budget. The striking empirical result, across 17 frontier models, is that the dominant predictor of success is not the quality of the agent's initial attempt but its persistence — its willingness to repeatedly benchmark, edit, and incorporate noisy empirical feedback over many cycles. Most models, including proprietary ones, either terminate prematurely or exhaust their budget with minimal progress; claude-opus-4.6 is called out as a strong exception.

This is a sharper, more operational claim than "agents should iterate." It says the binding constraint is a behavioral disposition toward sustained empirical grounding, and that disposition is unevenly distributed across models that look comparable on one-shot benchmarks. It grounds Should agent evaluation measure more than task success? with a concrete trajectory-level predictor, and it sits naturally alongside Does raw token spending actually predict agent performance? — persistence only pays if each loop returns informative, retained feedback, otherwise it is budget-burning churn, not progress.

The mechanism cuts against itself, however. Do models fail worse when their own errors fill the context? implies that more loops mean more accumulated mistakes in context, which should degrade the very iteration AUTOLAB rewards. The reconciliation is probably that persistence pays only when paired with calibrated scoring that lets the agent see whether an edit actually helped — pure persistence without trustworthy feedback would amplify error. That is why the authors single out harness design as the promising lever: the harness, not the backbone alone, decides whether long horizons compound feedback or compound noise.

Inquiring lines that use this note as a source 10

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 125 in 2-hop network ·dense cluster Open in graph ↗

What predicts success in ultra-long-horizon agen… Should agent evaluation measure more than task suc… Does raw token spending actually predict agent per… Do models fail worse when their own errors fill th…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Should agent evaluation measure more than task success? Current benchmarks reduce agents to a single success score, but agents emerge from multiple interacting systems. What dimensions of agent behavior should builders actually measure to predict deployment readiness?
grounds: supplies persistence/feedback-incorporation as a concrete trajectory-quality predictor
Does raw token spending actually predict agent performance? Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.
extends: persistence converts to progress only when each loop yields informative, retained feedback
Do models fail worse when their own errors fill the context? As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
contradicts/qualifies: more iteration accumulates errors in context, so persistence helps only with trustworthy scoring

What predicts success in ultra-long-horizon agent tasks?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4