Can reasoning topology reveal what accuracy metrics miss?
Identical accuracy scores hide different reasoning structures. Can we measure the shape of logical reasoning—how concentrated or diffuse it is—and would this metric predict model quality better than token count?
We evaluate reasoning models almost entirely with two one-dimensional numbers: final-answer accuracy and token count. The Reasoning Structure work shows both numbers are lossy in a specific, measurable way — identical scores can hide fundamentally different reasoning structures. By converting unstructured traces into verifiable graphs of claims and dependencies, the paper turns a reasoning trace into a topological object whose shape can be measured, and defines a reasoning-efficiency metric η that quantifies how concentrated the model's logical flow is.
The decisive finding is that η is essentially uncorrelated with token count (r = −0.05, p = 0.64). That kills the most common heuristic in the field — "more thinking tokens means more reasoning." Kimi K2 burns the largest token budgets yet does not beat GPT-5, which is simultaneously the most accurate and the most token-efficient at every difficulty. Token count is therefore not a proxy for reasoning quality; it is at best a proxy for verbosity. This sharpens Does more thinking time always improve reasoning accuracy? from a claim about quantity (too many tokens hurt) into a claim about structure (the same token budget can buy concentrated or diffuse reasoning, and only structure predicts quality).
There is a harder finding underneath the efficiency story. On the difficulty sweep from Trivial to Human-hard, every model collapses — GPT-5 from 83.8% to 5.7%, several models to literal 0% — while completion tokens balloon to 20–61k. More compute does not rescue the hardest regime for any model, which suggests the limit is not a budget problem but something structural about scaling reasoning by allocating more computation. This grounds Do language models fail at reasoning due to complexity or novelty?: if failure is about instance novelty rather than chain length, then throwing tokens at a hard puzzle extends a chain that cannot reach the answer.
The caveat is scope: logic puzzles are fully specified and unambiguously verifiable, which is exactly what makes graph extraction possible but also what makes them unlike open-ended real tasks. Whether η transfers to domains without clean ground-truth dependency graphs is unproven — the topology metric may be a gift of the puzzle setting rather than a general instrument.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
extends: reframes the token-quantity finding as a structural one — flow concentration, not token count, tracks quality
-
Do language models fail at reasoning due to complexity or novelty?
Explores whether reasoning-model failures stem from task complexity thresholds or from encountering unfamiliar instances. Tests whether scaling chain length actually addresses the root cause of reasoning breakdown.
grounds: the puzzle-difficulty collapse despite huge token budgets fits an instance-novelty rather than compute-budget account
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
convergent-with: both decouple trace length from reasoning quality, here via topology rather than correctness-conditioned length
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
- Topology of Reasoning: Understanding Large Reasoning Models through Reasoning Graph Properties
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Reasoning Structure of Large Language Models
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
- Do Large Language Models Latently Perform Multi-Hop Reasoning?
- On the Reasoning Capacity of AI Models and How to Quantify It
Original note title
reasoning topology separates models that accuracy and token count conflate — flow concentration is the diagnostic those metrics hide