Do short benchmarks predict how models perform over long workflows?

Inquiring lines that read this note 17

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can model confidence signals reliably improve reasoning quality and calibration?

Can an LLM be well calibrated but still unreliable on single evaluations?

Can single-axis benchmarks accurately predict agent deployment success?

What specific metrics distinguish single-turn versus multi-turn collaboration success?
What deployment context determines which benchmark mode actually matters?
How should benchmarks evaluate workflow architecture versus raw model performance?
Why do short interaction benchmarks fail to predict long horizon performance?
Should long horizon performance be measured as a separate evaluation axis?
Can a single axis benchmark ever represent deployment readiness accurately?
What real-world tasks most clearly expose gaps between benchmark performance and actual capability?
Can automated benchmarks accurately capture progress on real-world long-horizon tasks?
How should single-axis benchmarks account for separable capability dimensions?

How do evaluation biases undermine LLM quality assessment systems?

Why do backward-looking benchmarks underestimate LLM scientific value?

What drives capability and cost efficiency in agent systems?

What separates good workflow design from poor workflow design?

Why do benchmark improvements fail to reflect actual reasoning quality?

What causes silent corruption to amplify through delegated workflows?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

How should we measure and report serial compute separately?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 139 in 2-hop network ·dense cluster Open in graph ↗

Do short benchmarks predict how models perform o… Do frontier LLMs silently corrupt documents in lon… Are LLM and agent benchmarks really measuring diff… Do models fail worse when their own errors fill th…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do frontier LLMs silently corrupt documents in long workflows? Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.
same paper, the underlying phenomenon
Are LLM and agent benchmarks really measuring different things? Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.
same paper, complementary methodology implication
Do models fail worse when their own errors fill the context? As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
adjacent: known long-horizon failure mode

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

LLMs Corrupt Your Documents When You Delegate0.82 match · arxiv ↗
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs0.80 match · arxiv ↗
LLMs Get Lost In Multi-Turn Conversation0.78 match · arxiv ↗
UserBench: An Interactive Gym Environment for User-Centric Agents0.77 match · arxiv ↗
Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models0.77 match · arxiv ↗
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference0.77 match · arxiv ↗
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?0.77 match · arxiv ↗
Towards a Science of Scaling Agent Systems0.76 match · arxiv ↗

Search by related questions 4

Suggested questions this note speaks to — click to search the collection, or type your own.