A Primer in Post-Training Reasoning Data: What We Know About How It Works

Paper · arXiv 2606.02113 · Published June 1, 2026

Post-training has become a primary driver of recent progress in large reasoning models, and reasoning data are often the key variable determining whether this stage succeeds. Work on post-training reasoning data has grown rapidly, yet this literature remains scattered across dataset papers, reinforcement-learning recipes, reward-model studies, benchmarks, and frontier system reports. This paper is the first primer to synthesize over 150 key public studies and system reports on post-training reasoning data. We organize the field around four questions: what data objects exist, what makes them useful, how they are constructed, and how they scale. Together, this organization provides an attribution framework for future reasoningdata releases and post-training recipes.

Introduction. The training of large language models typically consists of two stages: pre-training and posttraining. With the rise of o1-style test-time scaling paradigms and thinking models, post-training has become increasingly important for further advancing model capabilities (OpenAI, 2024; DeepSeek- AI et al., 2025a). Recent studies suggest that, within the post-training pipeline, the quality and construction of training data often have a greater impact on model performance than other components, such as training algorithms or optimization strategies (Guha et al., 2025; He et al., 2025b; Xu et al., 2025b). However, despite the central role of data in post-training, there has not yet been a systematic survey dedicated to post-training data. This paper fills this gap. To the best of our knowledge, it is the first survey focused on post-training data, reviewing 150 key papers in this area and synthesizing practical insights into how post-training data is constructed, curated, and utilized.

Discussion / Conclusion. Post-training reasoning data have become the hidden substrate of reasoning-model progress, and the stream of releases is accelerating. The evidence reviewed here suggests that the reusable unit is not a prompt–response pair, but a verifier-bearing feedback interface whose value depends on the verifier, base model, lineage, optimizer, scaffold, and inference budget. The central unresolved question is therefore attribution: when a model improves, which part of that interface changed? This primer helps the community move from reporting gains to making them inspectable, comparable, and testable.

A Primer in Post-Training Reasoning Data: What We Know About How It Works

Synthesis notes that discuss concepts related to this paper