SYNTHESIS NOTE
Agentic Systems and Tool Use

Are LLM and agent benchmarks really measuring different things?

Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.

Synthesis note · 2026-05-18 · sourced from Flaws

The benchmark literature has bifurcated. There are "LLM benchmarks" (MMLU, GPQA, math, code) that test the model in pure-completion mode, and "agent benchmarks" (SWE-bench, WebArena, OSWorld) that test the model in tool-using, multi-step, environment-coupled mode. These have grown into nearly disjoint research communities.

The DELEGATE-52 authors argue this is a category error. The two benchmark families do not measure two different artifacts — they measure two different operating modes of the same artifact. A model is not a "good LLM" or a "good agent" in isolation. It is a model whose behavior is conditioned on whether it is asked to produce one answer or to operate through a tool loop. The same underlying weights respond differently in the two modes, and a model can be strong in one and weak in the other for reasons that are about mode-specific calibration rather than mode-specific intelligence.

The methodological consequence: characterizing a model honestly requires evaluation across modes. A model that scores 90 on MMLU and 30 on a long-horizon agent task is not "a 90 model with an agent problem to solve" — it is a model whose capability has two numbers, and the deployment context decides which one matters.

For builders, this argues against treating "agent capability" as a separate research target to be optimized after general capability. The two modes interact. Agentic deployment surfaces failures that completion-mode benchmarks cannot see, and completion-mode strengths do not transport cleanly to agentic settings.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 117 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

LLM and agent benchmarks are two modes of the same model not separate fields