Featured

ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

Jinu Lee, Shivam Agarwal, Amruta Parulekar, et al. · arXiv:2606.05402

As reasoning traces raise questions about whether models are actually thinking or merely performing reasoning, researchers face a deeper puzzle: how to tell the difference between discourse that guides genuine problem-solving and discourse that merely decorates it. ReasoningFlow contributes a structural lens to this debate by mapping reasoning steps into directed graphs, revealing the hidden scaffolding beneath traces that appear linear in prose—yet the paper's own finding that mechanistic causal dependencies diverge from language-level structure suggests a troubling gap. If chain-of-thought traces may optimize model performance while obscuring true understanding, then parsing their discourse structures more finely might simply give us higher-resolution opacity. The question becomes: does better visibility into *how* reasoning traces are organized bring us closer to understanding *whether* the reasoning is real, or does it risk systematizing an illusion?

Abstract

Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek-V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse fine-grained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure. We release the dataset and code in: https://github.com/jinulee-v/reasoningflow.

Adjacent research

Synthesis notes nearest this paper, framed as questions — click to read.

Do chain-of-thought traces actually help users understand model reasoning? Do reasoning traces actually cause correct answers? Can structuring reasoning as knowledge graphs help smaller models solve complex tasks?

Lines of inquiry this paper opens

Explore in faceted view

Not questions with answers — ways of approaching this research. Each opens a synthesized line of inquiry across the collection.

Reasoning Trace Reliability

Reasoning Model Failure Modes

Chain-of-Thought Faithfulness

Reasoning Model Quality & Training

Reasoning Model Self-Correction Failures

What distinguishes inductive inference from negative evidence versus positive patterns?

Scaling, Sparsity & Data Trade-offs

Does iterative denoising order affect the reasoning style diffusion models learn?

All featured →