What predicts success in ultra-long-horizon agent tasks?
Does an agent's initial solution quality matter more than its willingness to iterate? AUTOLAB's frontier-model benchmark suggests persistence through feedback loops may be the true differentiator.
AUTOLAB reframes what a long-horizon agent benchmark should test. Most agentic evals score either single-turn responses or short interactive trajectories; AUTOLAB instead hands the agent a correct but deliberately suboptimal baseline across 36 expert-curated tasks (system optimization, CUDA kernels, model development, puzzles) and asks it to improve the artifact within a strict wall-clock budget. The striking empirical result, across 17 frontier models, is that the dominant predictor of success is not the quality of the agent's initial attempt but its persistence — its willingness to repeatedly benchmark, edit, and incorporate noisy empirical feedback over many cycles. Most models, including proprietary ones, either terminate prematurely or exhaust their budget with minimal progress; claude-opus-4.6 is called out as a strong exception.
This is a sharper, more operational claim than "agents should iterate." It says the binding constraint is a behavioral disposition toward sustained empirical grounding, and that disposition is unevenly distributed across models that look comparable on one-shot benchmarks. It grounds Should agent evaluation measure more than task success? with a concrete trajectory-level predictor, and it sits naturally alongside Does raw token spending actually predict agent performance? — persistence only pays if each loop returns informative, retained feedback, otherwise it is budget-burning churn, not progress.
The mechanism cuts against itself, however. Do models fail worse when their own errors fill the context? implies that more loops mean more accumulated mistakes in context, which should degrade the very iteration AUTOLAB rewards. The reconciliation is probably that persistence pays only when paired with calibrated scoring that lets the agent see whether an edit actually helped — pure persistence without trustworthy feedback would amplify error. That is why the authors single out harness design as the promising lever: the harness, not the backbone alone, decides whether long horizons compound feedback or compound noise.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do trajectory quality metrics predict agent safety and user trust?
- Why has agent research prioritized policy over world model development?
- What makes skills worth externalizing into a persistent harness?
- Can single-axis benchmarks measure across all three agent capability layers?
- Can agent-authored skill libraries compound autonomy gains over time?
- How do agents decide which created code deserves long-term persistence?
- Why do most frontier models terminate early on long-horizon benchmarks?
- Should evaluations shift toward open-world messy tasks instead of contests?
- Why do persistent AI systems require fundamentally different design than ad-hoc supporters?
- Why do AI agents struggle with novel experiments but excel at routine tasks?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Should agent evaluation measure more than task success?
Current benchmarks reduce agents to a single success score, but agents emerge from multiple interacting systems. What dimensions of agent behavior should builders actually measure to predict deployment readiness?
grounds: supplies persistence/feedback-incorporation as a concrete trajectory-quality predictor
-
Does raw token spending actually predict agent performance?
Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.
extends: persistence converts to progress only when each loop yields informative, retained feedback
-
Do models fail worse when their own errors fill the context?
As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
contradicts/qualifies: more iteration accumulates errors in context, so persistence helps only with trustworthy scoring
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- Artifacts as Memory Beyond the Agent Boundary
- Open-World Evaluations for Measuring Frontier AI Capabilities
- Agents' Last Exam
- What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
Original note title
on ultra-long-horizon optimization the predictor of agent success is persistence in the feedback loop not the quality of the first attempt