How should benchmarks evaluate workflow architecture versus raw model performance?
This explores whether benchmarks should measure how a system is assembled — its planning/execution structure, tool wiring, and ability to sustain long tasks — rather than just scoring the underlying model on single-turn problems.
This explores whether benchmarks should measure how a system is assembled — its planning/execution structure, tool wiring, and how it holds up over long tasks — rather than just ranking the raw model. The corpus suggests the conventional single-score, single-turn benchmark is quietly misleading, because the thing that determines real-world performance often isn't the model at all — it's the architecture around it.
The sharpest evidence is that short benchmarks simply don't predict long workflows. Models that rank similarly on single-turn tasks diverge dramatically once you chain them across many hand-offs, with degradation curves that standard tests never reveal Do short benchmarks predict how models perform over long workflows?. And the degradation isn't loud — frontier models silently corrupt roughly a quarter of document content over extended relay tasks, with errors compounding rather than plateauing Do frontier LLMs silently corrupt documents in long workflows?. A benchmark that stops at turn one would call these models excellent. So the first thing a workflow-aware benchmark has to do is measure over time, not over a snapshot.
Several notes converge on the idea that architecture frequently dominates raw capability — meaning a benchmark that holds architecture fixed is measuring the wrong variable. LLMs turn out to forecast better than recognized, but only when the workflow separates numerical from contextual reasoning; monolithic prompting hides the ability entirely Can LLMs actually forecast time series better than we think?. Splitting a decomposer from a solver beats a single monolithic model, and notably the decomposition skill generalizes across domains while solving doesn't Does separating planning from execution improve reasoning accuracy?. In recommenders, constraint design and inductive bias beat sheer model depth What architectural choices actually improve recommender system performance?. The lesson for evaluation: the same model can land at very different scores depending on how it's wired, so a fair benchmark must treat the workflow as the unit under test, not as a constant.
The corpus also argues that capability isn't a scalar at all. Agent ability decomposes into at least five separable axes — task success, privacy compliance, long-horizon retention, mode-shift behavior, ecosystem readiness — and models that top one axis often sink on another, which makes any single number systematically misleading for deployment Does a single benchmark score actually predict agent readiness?. That's a direct prescription: report a vector, not a leaderboard rank. It also points to which axes belong on a workflow benchmark — determinism and tool integration among them, since direct function calls beat protocol-mediated tool access in production, and most teams abandon frameworks precisely to recover that reliability Why do protocol-based tool integrations fail in production workflows?.
The deeper payoff is learning to separate failures the model causes from failures the architecture causes — because a benchmark that can't tell them apart will keep blaming the wrong thing. Some ceilings are genuinely architectural and no amount of model scaling fixes them: autoregressive generation can't retract a token, so it structurally fails at constraint satisfaction until you bolt on a symbolic solver Why does autoregressive generation fail at constraint satisfaction?. Other failures look like weak models but are really structural disorganization — reasoning models that wander and abandon good paths recover accuracy from decoding-level nudges, no fine-tuning needed Why do reasoning models abandon promising solution paths? — while extended reasoning gives no consistent edge on numerical optimization because the bottleneck is procedure, not thinking steps Do reasoning models actually beat standard models on optimization?. And restructuring reasoning as recursive subtask trees with cache pruning lets one model do what people assumed needed a multi-agent system Can recursive subtask trees overcome context window limits?. Read together, these say the most useful benchmark isn't one that asks 'which model wins' but one that varies the architecture and the task horizon to expose where capability actually comes from — which is often the scaffolding, not the weights.
Sources 11 notes
DELEGATE-52 evaluated models across 50-round-trip relays and found short-interaction performance does not predict sustained delegation accuracy. Models ranking similarly on single-turn tasks diverged dramatically by relay 25, revealing degradation curves invisible to standard benchmarks.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
Research shows that architectural choices like removing hidden layers, enforcing constraints on self-similarity, and using appropriate likelihood functions deliver better results than deeper or more complex models. This suggests that problem-specific design decisions matter more than raw representational capacity.
Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.
MCP integration caused non-deterministic failures through ambiguous tool selection and parameter inference. Replacing it with explicit direct function calls and single-tool-per-agent design restored determinism. A 306-practitioner survey confirms 85% of production teams build custom agents, forgoing frameworks.
The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.