What is the actual reusable unit of reasoning data?
Does post-training reasoning transfer as prompt-response pairs, or as something more complex? Understanding what artifact actually drives gains matters for reproducibility and attribution.
The most useful move in this survey of 150+ post-training studies is a reframing of what reasoning data actually is. The field talks as if the asset being released is a dataset of prompt-response pairs. The primer argues the real reusable unit is a "verifier-bearing feedback interface" whose value is inseparable from six entangled factors: the verifier, the base model, the data lineage, the optimizer, the scaffold, and the inference budget. Change any one and the same "data" produces different gains. The central unresolved question therefore becomes attribution: when a model improves, which part of that interface changed?
This is the connective tissue under several findings the vault already holds separately. When does RL actually extend reasoning beyond pretraining? is exactly the base-model-and-lineage dependency the primer names — gains attributed to "data" are really data-times-headroom. Does RL teach reasoning or just when to use it? is the optimizer-and-scaffold dependency: the interface re-weights existing capability rather than installing new data content. And How do quality, diversity, and complexity affect synthetic data differently? is the construction half of the same problem — a dataset's effect cannot be read off its quality alone because the verifier and budget co-determine it.
The strongest counterargument is that "it's all entangled" can become an excuse for never isolating anything — a survey-level shrug. The primer's defense is that attribution is tractable if releases ship the interface, not just the pairs: report the verifier, the base, the optimizer, the budget, so gains become inspectable, comparable, and testable. For writing, the sharp claim is that the post-training literature's reproducibility crisis is a units problem — people are sharing the wrong object, and benchmark numbers without the interface are uninterpretable.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
When does RL actually extend reasoning beyond pretraining?
Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.
grounds (the base-model and lineage dependency the primer names)
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
grounds (the optimizer/scaffold dependency: re-weighting not new content)
-
How do quality, diversity, and complexity affect synthetic data differently?
When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.
extends (construction-side instance of the attribution problem)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- A Primer in Post-Training Reasoning Data: What We Know About How It Works
- An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models
- On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
- OpenThoughts: Data Recipes for Reasoning Models
- Eliciting Reasoning in Language Models with Cognitive Tools
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Original note title
the reusable unit of post-training reasoning is not a prompt-response pair but a verifier-bearing feedback interface — which is why reasoning gains resist attribution