Reasoning Structure of Large Language Models

Paper · arXiv 2606.03883 · Published June 2, 2026
Reasoning Critiques

Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model’s logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.

Introduction. Reasoning is central to human intelligence and remains a major challenge for machine learning systems. Thanks to their ability to exploit increased test-time compute through long Chain-of-Thought (CoT) traces (Wei et al., 2022), Large Reasoning Models (LRMs) have shown impressive performance on a broad set of reasoning tasks, including complex coding (Chen et al., 2021), logical deduction (Lin et al., 2025), mathematical reasoning (Cobbe et al., 2021), and spatial reasoning (Berdoz et al., 2026). However, because most evaluations collapse behavior into one-dimensional metrics such as final-answer accuracy or token count, it remains unclear how these models reason. This gap has motivated prior work to develop controllable puzzle environments that are less prone to benchmark saturation and data contamination (Chen et al., 2025a; Zhang et al., 2025). Logic puzzles have long attracted human curiosity because they are “easy to learn, but hard to master.” Unlike many real-world tasks, they are fully specified and admit unambiguous verification.

Discussion / Conclusion. Difficulty scaling exposes steep accuracy drop despite increased compute. Across all models, accuracy declines sharply as difficulty increases from Trivial to Human hard (e.g., GPT-5 drops from 83.8% to 5.7%; Qwen3 235B drops from 69.5% to 0%; DeepSeek V3.2 drops from 76.2% to 0%; Kimi K2 drops from 77.1% to 0.95%), while mean completion tokens rise substantially (from roughly 4-11k to approximately 20-61k). Notably, more tokens do not imply better performance. Kimi K2 consistently uses the largest token budgets yet does not outperform GPT-5, whereas GPT-5 achieves the best accuracy at every difficulty while remaining the most token-efficient. DeepSeek V3.2 is competitive at Human normal (51.4%) but collapses at Human hard despite increased tokens, and Qwen3 235B degrades earlier (only 21.0% at Human normal) and reaches 0% on Human hard. Overall, the hardest regime remains largely unsolved for all models even with large token budgets, suggesting fundamental limitations in scaling reasoning beyond simply allocating more computation. Token count is not a proxy for reasoning quality. We find that token count alone is a poor proxy for reasoning quality. Across runs, reasoning-flow efficiency η is essentially uncorrelated with token count (r = −0.05, p = 0.64).