SYNTHESIS NOTE

Can research papers preserve the experiments that failed?

Traditional papers compress iterative research into linear narratives, discarding failed attempts and implementation details. Could structuring papers as machine-readable packages with exploration graphs make this hidden knowledge visible and reproducible?

Synthesis note · 2026-06-27 · sourced from Agentic Research

Publishing compresses a branching, iterative research process into a linear narrative, and ARA (Agent-Native Research Artifact) names the two costs of that compression precisely. The Storytelling Tax is the systematic erasure of process knowledge — failed experiments, rejected hypotheses, the backtracking that explains why the final approach was chosen — to fit human-readable convention. The Engineering Tax is the gap between reviewer-sufficient prose and agent-sufficient specification: critical implementation details left unwritten because a human referee never needed them. Both taxes were tolerable when every consumer was human. They become critical when AI agents must reproduce and extend the work, because an agent cannot interpolate the tacit knowledge a human author assumed. ARA's response is structural: a four-layer package — scientific logic, executable code with full specs, an exploration graph that preserves the discarded failures, and evidence grounding every claim in raw outputs.

What makes this more than a better file format is the framing of failure-knowledge as a first-class published object rather than an editing casualty. The exploration graph treats the rejected branches — the part that explains the result — as the deliverable, not the offcut. This is the supply-side complement to the demand-side problem in Can AI verify research outputs as fast as it generates them?: if agents generate faster than anyone verifies, then publishing evidence-grounded, machine-checkable artifacts is exactly the verification substrate that pace requires. It also reframes Where does AI assistance become unreliable in research? — much of what makes the "novel experiment / judgment" stage hard for agents may be the missing tacit specification ARA aims to externalize, not an irreducible capability gap.

The skeptical reading: ARA presumes the value of a paper is its reproducible specification, but a paper's narrative also does persuasive and sense-making work — it argues why the result matters, which an exploration graph does not supply. There is also an incentive problem the protocol cannot solve by structure alone: authors are rewarded for clean stories, and publishing one's failed branches exposes the messiness that the Storytelling Tax was designed to hide. The format can make failure-knowledge expressible; it cannot by itself make researchers want to expose it.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Can research papers preserve the experiments tha… Can AI verify research outputs as fast as it gener… Where does AI assistance become unreliable in rese… Can frontier exams really measure cutting-edge AI …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can AI verify research outputs as fast as it generates them? Research suggests AI systems produce plausible findings rapidly but struggle to verify them at the same pace. This creates a bottleneck in verification across all research stages. Understanding this gap matters for assessing when AI assistance is reliable versus risky.
grounds: ARA's evidence layer is a verification substrate for the generation-outpaces-verification problem
Where does AI assistance become unreliable in research? This explores whether AI capability follows a sharp boundary in research tasks, and what determines which side of that line a task falls on. Understanding this matters because it reveals where humans must stay in control.
extends: missing tacit specification (the Engineering Tax) may explain part of the assistance/autonomy boundary
Can frontier exams really measure cutting-edge AI capability? Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?
convergent-with: both target agents as first-class research participants rather than human assistants

Can research papers preserve the experiments that failed?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4