Do short benchmarks predict how models perform over long workflows?
Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.
Most LLM benchmarks evaluate single-turn or short-multi-turn interaction. DELEGATE-52 extends evaluation to 50-round-trip relays and finds that short-interaction performance is not predictive of how the same model behaves under sustained delegation. Models that perform comparably on a single edit can diverge dramatically by relay 25.
This is a methodological finding, not a model finding. The standard practice — pick the top scorer on benchmark X, deploy it in workflow Y — implicitly assumes that capability is roughly stationary across interaction lengths. The relay results show the assumption fails. Models exhibit a degradation curve, and that curve has its own shape parameters (slope, decay rate, recovery behavior under interrupted sessions) that benchmarks built for short tasks cannot expose.
The implication is that "long-horizon performance" deserves status as a distinct evaluation axis, not as a property to be inferred from single-step competence. A model with strong relay-50 retention but mediocre single-turn polish may be more useful for delegated work than the inverse. The paper argues this directly: capability research has been investing heavily in memory management while leaving the underlying long-interaction degradation profile under-measured.
For practitioners, this changes the deployment question from "which model scores highest on X" to "which model maintains accuracy through the interaction length my workflow requires." For benchmark designers, it argues for relay-style evaluations as a default rather than an add-on.
Inquiring lines that use this note as a source 15
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can an LLM be well calibrated but still unreliable on single evaluations?
- What specific metrics distinguish single-turn versus multi-turn collaboration success?
- Why do backward-looking benchmarks underestimate LLM scientific value?
- What deployment context determines which benchmark mode actually matters?
- What separates good workflow design from poor workflow design?
- How should benchmarks evaluate workflow architecture versus raw model performance?
- What is the gap between benchmark performance and real workplace task completion?
- Why do short interaction benchmarks fail to predict long horizon performance?
- Do reasoning benchmarks predict real performance in long delegated workflows?
- Should long horizon performance be measured as a separate evaluation axis?
- Which model capabilities actually matter for sustained workflow delegation?
- How should we measure and report serial compute separately?
- Which workflow positions concentrate the most downstream dependencies and influence?
- Can a single axis benchmark ever represent deployment readiness accurately?
- What real-world tasks most clearly expose gaps between benchmark performance and actual capability?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do frontier LLMs silently corrupt documents in long workflows?
Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.
same paper, the underlying phenomenon
-
Are LLM and agent benchmarks really measuring different things?
Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.
same paper, complementary methodology implication
-
Do models fail worse when their own errors fill the context?
As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
adjacent: known long-horizon failure mode
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LLMs Corrupt Your Documents When You Delegate
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- LLMs Get Lost In Multi-Turn Conversation
- UserBench: An Interactive Gym Environment for User-Centric Agents
- Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
- Towards a Science of Scaling Agent Systems
Original note title
short-interaction LLM benchmarks do not predict long-horizon delegated-workflow performance