INQUIRING LINE

How can post-training research become reproducible without releasing full interfaces?

This explores what would actually have to be shared for someone to rebuild a post-training result — and whether you can get there without publishing the entire training apparatus.


This explores what would actually have to be shared for someone to rebuild a post-training result, and whether reproducibility survives when the full training interface stays private. The corpus has a blunt answer to the first half: the thing you'd need to release isn't the dataset. The reusable unit of post-training reasoning is a *feedback interface* entangled with six moving parts — verifier, base model, lineage, optimizer, scaffold, and budget — and changing any one of them changes what the same data does What is the actual reusable unit of reasoning data?. That's why a posted dataset reproduces almost nothing on its own: the signal lives in the coupling, not the rows. So the honest version of the question is which slices of that coupling are load-bearing enough that you can't omit them.

What makes this hard is that the most consequential factors are exactly the ones that stay hidden. When you start from a proprietary pretrained model, RL doesn't add capability so much as amplify one already-dominant format from pretraining and quietly suppress the alternatives — and which format wins depends on model scale, not performance Does RL training collapse format diversity in pretrained models?. Two labs running 'the same' recipe on different base models can land in different places for reasons neither can see. There's a similar trap on the evaluation side: imitation training can mimic a strong model's confident style well enough to fool human graders while closing zero real capability gap Can imitating ChatGPT fool evaluators into thinking models improved?, and LLM judges reward fake references and rich formatting independent of content Can LLM judges be tricked without accessing their internals?. If your verifier or judge is part of the un-released interface, a 'reproduced' number can be an artifact of the grader, not the method.

The corpus also shows that small training choices flip the sign of results, which raises the reproducibility bar further. Train on nearly-impossible RLVR problems and group-relative normalization treats rare lucky successes as high-advantage, teaching shortcut and computation-skipping behaviors that contaminate existing skills Do overly hard RLVR samples actually harm model capabilities?. That's a difficulty-curve and optimizer-detail dependency invisible in a dataset dump but decisive for outcomes — exactly the kind of thing a reproducer needs disclosed.

Where the corpus gets constructive is in *how to disclose without dumping everything.* Two complementary moves. First, package the process, not the polished paper: narrative write-ups impose a 'storytelling tax' that erases failed branches and an 'engineering tax' that omits implementation specs, while agent-native research artifacts ship the executable logic, the exploration graph of what failed, and evidence grounding as first-class deliverables Can research papers preserve the experiments that failed?. Second, treat the distilled capability itself as a versioned, inspectable file rather than hidden prompt state — separating *what the system knows* from *how it behaves* so each can be audited, corrected, and rolled back independently Can person-grounded skills remain auditable without hidden prompt state?. Together these suggest the answer isn't 'release the interface or give up.' It's release the *parameters of* the interface — the verifier definition and its known biases, the base-model lineage, the optimizer settings, the difficulty distribution, and the failed branches — as auditable artifacts, even when the raw weights or proprietary scaffold stay closed.

The thing you didn't know you wanted to know: reproducibility here is less about open weights and more about *attribution*. The corpus reframes 'reproducible' to mean 'I can tell which of the six entangled factors caused your result' — and most of those can be specified in text and code without handing over the model. What kills reproducibility isn't a private interface; it's an *undocumented* one.


Sources 7 notes

What is the actual reusable unit of reasoning data?

The reusable unit in post-training is a feedback interface entangled with six factors: verifier, base model, lineage, optimizer, scaffold, and budget. Changing any one alters the same data's effect, making attribution tractable only when these are jointly released.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can research papers preserve the experiments that failed?

Publishing imposes a Storytelling Tax (erasing process, failed branches, tacit reasoning) and Engineering Tax (omitting implementation specs). Agent-Native Research Artifacts address both by packaging logic, executable code, exploration graphs of failures, and evidence grounding—treating rejected branches as publishable deliverables rather than editorial casualties.

Can person-grounded skills remain auditable without hidden prompt state?

COLLEAGUE.SKILL treats distilled expertise as versioned files subject to inspection, correction, and rollback—not hidden prompt state. Separating capability tracks from behavior tracks enables independent audit of what someone knows versus how they act.

Next inquiring lines