AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

Paper · arXiv 2606.05080 · Published June 3, 2026
LLM Evaluations and Benchmarks

Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AUTOLAB, a new benchmark for ultra long-horizon closed-loop optimization. AUTOLAB consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent’s initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress.

Introduction. Frontier LLM agents are increasingly deployed on tasks that play out over hours rather than minutes, from post-training models (Rank et al., 2026) and optimizing low-level systems (Chi et al., 2026) to running open-ended research loops (Novikov et al., 2025; Karpathy, 2026). Progress on such tasks is iterative: it comes from inspecting an artifact, proposing a change, running experiments, measuring the outcome, and refining over many cycles, not from a single correct answer. Sustaining this loop over a long horizon requires managing time, compute, and noisy empirical signals. Short, single-shot evaluations are not designed to test whether today’s frontier models can do so. Current evaluations largely overlook this regime. Static, single-turn coding benchmarks primarily test model knowledge and one-shot coding (Jain et al., 2025a; Zhuo et al., 2025). Another wave of agentic benchmarks has extended to short, interactive trajectories (Mialon et al., 2023; Liu et al., 2024; Jimenez et al., 2024; Merrill et al., 2026).

Discussion / Conclusion. Taken together, these findings suggest that harness design itself is a promising direction for future research: carefully tuned harnesses, by offering more iteration headroom for smaller models and tighter, high-quality We introduced AUTOLAB, a benchmark for evaluating frontier models on ultra long-horizon research and engineering tasks that require sustained iteration over hours rather than minutes. By enforcing ultra longhorizon tasks, continuous calibrated scoring, and strong anti-hacking safeguards, AUTOLAB reveals that