Agents' Last Exam

Paper · arXiv 2606.05405 · Published June 3, 2026
LLM Evaluations and Benchmarks

Abstract Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents’ Last Exam (ALE), a benchmark designed to evaluate AI agents on long horizon, economically valuable, real world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 sub fields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is below 1%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP relevant impact.

Introduction. Over the past few years, AI systems have cleared one celebrated benchmark after another: worldchampion games [43], olympiad mathematics [18], and competitive programming [16]. Yet by the metric that ultimately matters, economic output, the broader impact has remained surprisingly muted; benchmark victories have accumulated faster than measurable transformation in core industries. This gap, which we view as a utility problem for AI, suggests that the field now needs evaluations that measure not only abstract competence, but also the ability to carry out long-horizon, economically valuable work in real professional environments. This gap matters because AI progress is remarkably shaped by the benchmarks the field chooses to optimize. Benchmarks do not merely record capability; they focus research attention, define engineering targets, and often determine which domains become tractable for rapid improvement.

Discussion / Conclusion. We introduced ALE, a benchmark of 960 expert-authored task workflows (1,490 task instances) across 55 digital industries, sourced from work experts have already shipped, anchored in the SOC/O*NET taxonomy, and scored through deterministic checks and structured rubrics rather than open-ended LLM judging. Frontier agents clear only a small fraction today; we release ALE as an instrument for closing the gap between benchmark success and GDP-relevant impact, where saturation would signal that agents can sustain the long-horizon, tool-intensive work professional practice actually requires.