SYNTHESIS NOTE

Are LLM and agent benchmarks really measuring different things?

Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.

Synthesis note · 2026-05-18 · sourced from Flaws

The benchmark literature has bifurcated. There are "LLM benchmarks" (MMLU, GPQA, math, code) that test the model in pure-completion mode, and "agent benchmarks" (SWE-bench, WebArena, OSWorld) that test the model in tool-using, multi-step, environment-coupled mode. These have grown into nearly disjoint research communities.

The DELEGATE-52 authors argue this is a category error. The two benchmark families do not measure two different artifacts — they measure two different operating modes of the same artifact. A model is not a "good LLM" or a "good agent" in isolation. It is a model whose behavior is conditioned on whether it is asked to produce one answer or to operate through a tool loop. The same underlying weights respond differently in the two modes, and a model can be strong in one and weak in the other for reasons that are about mode-specific calibration rather than mode-specific intelligence.

The methodological consequence: characterizing a model honestly requires evaluation across modes. A model that scores 90 on MMLU and 30 on a long-horizon agent task is not "a 90 model with an agent problem to solve" — it is a model whose capability has two numbers, and the deployment context decides which one matters.

For builders, this argues against treating "agent capability" as a separate research target to be optimized after general capability. The two modes interact. Agentic deployment surfaces failures that completion-mode benchmarks cannot see, and completion-mode strengths do not transport cleanly to agentic settings.

Inquiring lines that read this note 1

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can single-axis benchmarks accurately predict agent deployment success?

Can single benchmarks predict whether an agent will work in the real world?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 121 in 2-hop network ·dense cluster Open in graph ↗

Are LLM and agent benchmarks really measuring di… Do short benchmarks predict how models perform ove… Do frontier LLMs silently corrupt documents in lon… When do multi-agent systems actually outperform si…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do short benchmarks predict how models perform over long workflows? Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.
same paper, the relay-length specific case
Do frontier LLMs silently corrupt documents in long workflows? Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.
same paper, the empirical mode-divergence
When do multi-agent systems actually outperform single agents? As individual LLMs grow more capable, does the advantage of splitting work across multiple agents still hold? This explores when coordination overhead makes MAS counterproductive.
adjacent methodology: single vs multi-agent comparison

Are LLM and agent benchmarks really measuring different things?

Inquiring lines that read this note 1

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 5