Why do most frontier models terminate early on long-horizon benchmarks?
This explores why frontier models give up or stop short on tasks that require sustained, multi-step effort over long horizons — and what the corpus says actually separates the models that keep going from the ones that quit.
This explores why frontier models give up or stop short on long-horizon tasks — and the corpus's sharpest answer is that the deciding factor isn't raw intelligence but persistence. Across 17 frontier models on 36 expert optimization tasks, the dominant predictor of success was simply staying in the benchmark-edit-incorporate loop within the time budget; most models either terminated early or burned their budget unproductively, while Claude Opus 4.6 stood out for sticking with it What predicts success in ultra-long-horizon agent tasks?. So 'why terminate early' partly reframes to: what keeps a model in the loop, and what makes the loop worth staying in?
One deep reason is that the loop itself gets unreliable without an external anchor. Pure self-improvement stalls because of the generation-verification gap, diversity collapse, and reward hacking — a model can keep generating but can't reliably tell whether its next step is better, so continuing feels pointless Can models reliably improve themselves without external feedback?. The methods that do sustain progress smuggle in external signals: tests, proofs, type checks, judges, or tool feedback. A committee of weak model calls only matches strong models when a local soundness check converts good-but-unselected proposals into chosen ones When can weak models match strong model performance?. Without that verification scaffold, a long horizon is just a long walk in the dark, and early termination is the rational response.
There's also an architectural floor. Autoregressive generation can't retract a token it has already emitted, while many long tasks (constraint satisfaction especially) depend on discarding invalid partial work and backtracking — so the model hits a wall not because it ran out of ideas but because it has no primitive for undoing a bad commitment Why does autoregressive generation fail at constraint satisfaction?. Relatedly, the more instructions or constraints a task piles on, the more performance degrades in predictable patterns; reasoning models hold steady to around 150 instructions and then fail steeply How does instruction density affect model performance?. Long horizons accumulate exactly this kind of density, so 'terminate early' can be the visible symptom of a model quietly losing the thread.
The most uncomfortable possibility is that the benchmarks themselves manufacture the failure. Automated benchmarks privilege precisely-specified, auto-gradable tasks, which both overstate and understate real capability — open-world evaluation of messy long-horizon work, read through qualitative logs with cost reported, tells a different story Do automated benchmarks hide what frontier AI systems can really do?. The benchmark-to-GDP gap makes this concrete: agents clear abstract contests but fail real occupational workflows, because the field optimizes what it measures and has been measuring contests rather than work Why do agent benchmarks not predict real economic value?. Early termination on a long-horizon benchmark may say as much about how the horizon was constructed as about the model walking it.
The thing you might not have expected to want to know: the fixes here are less about bigger models and more about restructuring the loop. Returns from better memory architecture now exceed returns from adding parameters Has memory architecture replaced parameter count as the scaling frontier?, routing the right query to the right specialist beats scaling a single model Can routing beat building one better model?, and iterative latent depth lets a model spend more computation on harder steps instead of more parameters everywhere Can looped computation replace parameter count in world models?. Persistence, in other words, can be engineered into the harness — it doesn't have to be hoped for from the model.
Sources 10 notes
Across 17 frontier models on 36 expert-curated optimization tasks, repeated benchmark-edit-incorporate cycles within a wall-clock budget proved the dominant success predictor. Most models terminated early or burned budget unproductively; Claude Opus 4.6 stood out as persistent.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.
Sampling alone amplifies coverage but cannot select correct solutions. Reliable performance matching requires external soundness signals—tests, proofs, or type checks—that convert latent correct proposals into actual selections.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
IFScale benchmark shows three degradation patterns: linear (small models), exponential (mid-range), and threshold decay (reasoning models maintain ~150 instructions then fail steeply). Even best models reach only 68% accuracy at maximum density.
Automated benchmarks both overstate and understate capability by privileging precisely-specified, auto-gradable tasks. Open-world evaluations of long-horizon messy tasks through qualitative log analysis—with cost explicitly reported—correct these distortions and catch emerging capabilities earlier.
ALE's analysis of 960 real occupational workflows shows agents excel at abstract contests but fail long-horizon professional tasks. The gap is not model capability but benchmark design—the field optimizes what it measures, and it has measured contests rather than work.
Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.