SYNTHESIS NOTE

Can reasoning topology reveal what accuracy metrics miss?

Identical accuracy scores hide different reasoning structures. Can we measure the shape of logical reasoning—how concentrated or diffuse it is—and would this metric predict model quality better than token count?

Synthesis note · 2026-06-27 · sourced from Reasoning Critiques

We evaluate reasoning models almost entirely with two one-dimensional numbers: final-answer accuracy and token count. The Reasoning Structure work shows both numbers are lossy in a specific, measurable way — identical scores can hide fundamentally different reasoning structures. By converting unstructured traces into verifiable graphs of claims and dependencies, the paper turns a reasoning trace into a topological object whose shape can be measured, and defines a reasoning-efficiency metric η that quantifies how concentrated the model's logical flow is.

The decisive finding is that η is essentially uncorrelated with token count (r = −0.05, p = 0.64). That kills the most common heuristic in the field — "more thinking tokens means more reasoning." Kimi K2 burns the largest token budgets yet does not beat GPT-5, which is simultaneously the most accurate and the most token-efficient at every difficulty. Token count is therefore not a proxy for reasoning quality; it is at best a proxy for verbosity. This sharpens Does more thinking time always improve reasoning accuracy? from a claim about quantity (too many tokens hurt) into a claim about structure (the same token budget can buy concentrated or diffuse reasoning, and only structure predicts quality).

There is a harder finding underneath the efficiency story. On the difficulty sweep from Trivial to Human-hard, every model collapses — GPT-5 from 83.8% to 5.7%, several models to literal 0% — while completion tokens balloon to 20–61k. More compute does not rescue the hardest regime for any model, which suggests the limit is not a budget problem but something structural about scaling reasoning by allocating more computation. This grounds Do language models fail at reasoning due to complexity or novelty?: if failure is about instance novelty rather than chain length, then throwing tokens at a hard puzzle extends a chain that cannot reach the answer.

The caveat is scope: logic puzzles are fully specified and unambiguously verifiable, which is exactly what makes graph extraction possible but also what makes them unlike open-ended real tasks. Whether η transfers to domains without clean ground-truth dependency graphs is unproven — the topology metric may be a gift of the puzzle setting rather than a general instrument.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 150 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning topology separates models that accuracy and token count conflate — flow concentration is the diagnostic those metrics hide