ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs). We develop and validate our annotation schema through careful manual annotation of 31 traces (2.1k steps), achieving high inter-annotator agreement, then scale to automatic annotation of 1,260 traces (247.7k steps) spanning three tasks (math, science, argumentation) and five models (Qwen2.5-32B-Inst, QwQ-32B, DeepSeek- V3, DeepSeek-R1, GPT-oss-120B). By analyzing ReasoningFlow graphs, we find: (1) LRMs exhibit structurally similar traces, despite being trained from different base models and potentially non-overlapping post-training data. (2) ReasoningFlow reveals diverse finegrained reasoning behaviors (e.g., local verification, self-reflection, and assumptions) that can be used for better reasoning trace monitorability. (3) In LRMs, most of the erroneous steps are not used to derive final answers. (4) Mechanistic causal dependencies between steps do not reflect the language-level discourse structure.
Introduction. Large Reasoning Models (LRMs; e.g., DeepSeek- R1 (Guo et al., 2025)) generate extended reasoning traces with non-linear reasoning behaviors, such as verification, self-reflection, and backtracking (Gandhi et al., 2025). This non-linearity complicates both correctness evaluation and faithfulness monitoring. For instance, stepwise evaluation (Lightman et al., 2024) may flag an erroneous step, yet the trace as a whole may still be correct if the self-verification overrides the previous error. Recent attempts to understand the non-linear structure of LRM traces either lack expressive rela- tion labels or only annotate inter-paragraph structures (Bogdan et al., 2025; Jiang et al., 2025; Marjanovic et al., 2026), which are too coarse for annotating fine-grained reasoning behaviors. On the other hand, discourse structure annotations for human text (Carlson et al., 2001; Stab and Gurevych, 2017) fail to capture the relations and structures emerging in goal-oriented reasoning traces. We develop ReasoningFlow, a framework for annotating fine-grained discourse structures of reasoning traces.
Discussion / Conclusion. We propose ReasoningFlow, a comprehensive framework for annotating discourse structures in reasoning traces. ReasoningFlow dataset comprises 1.3k manually and automatically annotated traces with fine-grained node and edge labels. ReasoningFlow can be used to show structural similarity between LRMs, discover novel reasoning behaviors to the sub-sentence level, assess how erroneous steps causally affect the final answer, and identify the gap between mechanistic and discourse structures in reasoning models. Beyond the analyses presented here, we believe ReasoningFlow can serve as a general, humaninterpretable lens for studying the reasoning capabilities of LRMs.