SYNTHESIS NOTE
Agentic Systems and Tool Use Reasoning, Retrieval, and Evaluation

Are reasoning model collapses really failures of reasoning?

Explores whether language models hit a fundamental reasoning ceiling or whether text-only evaluation masks execution limitations. Examines how tool access might reveal hidden reasoning capabilities.

Synthesis note · 2026-02-23 · sourced from Flaws
Where exactly do reasoning models fail and break?

The "reasoning cliff" — where LRM performance collapses beyond certain complexity thresholds — is reframed as an execution failure, not a reasoning failure. When models are confined to text-only generation, they are forced into the role of "human simulator" (transcribing thousands of discrete steps) rather than "problem solver" (offloading procedural execution to appropriate tools).

The evidence: providing models with explicit algorithms for Tower of Hanoi does not prevent collapse. The model knows the algorithm but cannot execute it autoregressively at scale. This is a tool-use problem, not a reasoning problem. When given code execution access, models solve problems far beyond the supposed cliff.

Tool-enabled evaluation reveals an agentic hierarchy:

First-Order Agency — GPT-4o uses tools for straightforward procedural execution. It implements a strategy and runs it. When the strategy fails, it doesn't recover.

Second-Order Agency — o4-mini uses tools for verification and metacognitive self-correction. It begins with a flawed hypothesis, detects the failure through self-generated simulation, discards the failed strategy, and selects an entirely new correct approach. This plan-test-fail-revise loop mirrors deliberate practice.

The most revealing failure mode: when confined to text-only, models that cannot maintain state and exhaust search spaces declare solvable problems "logically impossible." They mistake their own execution limitations for fundamental impossibilities — a phenomenon analogous to learned helplessness.

The reframe has practical implications. The question shifts from "Can models reason?" to "What kind of reasoners are they, and under what conditions can they ascend the agentic hierarchy?" Evaluations that prohibit tool use are measuring execution bandwidth, not reasoning capability.

Inquiring lines that use this note as a source 185

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 148 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning model performance collapses are execution failures not reasoning failures — tool use reveals an agentic hierarchy