SYNTHESIS NOTE

What is the actual reusable unit of reasoning data?

Does post-training reasoning transfer as prompt-response pairs, or as something more complex? Understanding what artifact actually drives gains matters for reproducibility and attribution.

Synthesis note · 2026-06-27 · sourced from Reinforcement Learning

The most useful move in this survey of 150+ post-training studies is a reframing of what reasoning data actually is. The field talks as if the asset being released is a dataset of prompt-response pairs. The primer argues the real reusable unit is a "verifier-bearing feedback interface" whose value is inseparable from six entangled factors: the verifier, the base model, the data lineage, the optimizer, the scaffold, and the inference budget. Change any one and the same "data" produces different gains. The central unresolved question therefore becomes attribution: when a model improves, which part of that interface changed?

This is the connective tissue under several findings the vault already holds separately. When does RL actually extend reasoning beyond pretraining? is exactly the base-model-and-lineage dependency the primer names — gains attributed to "data" are really data-times-headroom. Does RL teach reasoning or just when to use it? is the optimizer-and-scaffold dependency: the interface re-weights existing capability rather than installing new data content. And How do quality, diversity, and complexity affect synthetic data differently? is the construction half of the same problem — a dataset's effect cannot be read off its quality alone because the verifier and budget co-determine it.

The strongest counterargument is that "it's all entangled" can become an excuse for never isolating anything — a survey-level shrug. The primer's defense is that attribution is tractable if releases ship the interface, not just the pairs: report the verifier, the base, the optimizer, the budget, so gains become inspectable, comparable, and testable. For writing, the sharp claim is that the post-training literature's reproducibility crisis is a units problem — people are sharing the wrong object, and benchmark numbers without the interface are uninterpretable.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 122 in 2-hop network ·dense cluster Open in graph ↗

What is the actual reusable unit of reasoning da… When does RL actually extend reasoning beyond pret… Does RL teach reasoning or just when to use it? How do quality, diversity, and complexity affect s…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

When does RL actually extend reasoning beyond pretraining? Does reinforcement learning genuinely expand a model's reasoning capabilities, or does it merely improve sampling from existing knowledge? This question hinges on whether pretraining provides sufficient foundation and whether RL targets tasks within reach.
grounds (the base-model and lineage dependency the primer names)
Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
grounds (the optimizer/scaffold dependency: re-weighting not new content)
How do quality, diversity, and complexity affect synthetic data differently? When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.
extends (construction-side instance of the attribution problem)

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

the reusable unit of post-training reasoning is not a prompt-response pair but a verifier-bearing feedback interface — which is why reasoning gains resist attribution

What is the actual reusable unit of reasoning data?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4