Can research papers preserve the experiments that failed?
Traditional papers compress iterative research into linear narratives, discarding failed attempts and implementation details. Could structuring papers as machine-readable packages with exploration graphs make this hidden knowledge visible and reproducible?
Publishing compresses a branching, iterative research process into a linear narrative, and ARA (Agent-Native Research Artifact) names the two costs of that compression precisely. The Storytelling Tax is the systematic erasure of process knowledge — failed experiments, rejected hypotheses, the backtracking that explains why the final approach was chosen — to fit human-readable convention. The Engineering Tax is the gap between reviewer-sufficient prose and agent-sufficient specification: critical implementation details left unwritten because a human referee never needed them. Both taxes were tolerable when every consumer was human. They become critical when AI agents must reproduce and extend the work, because an agent cannot interpolate the tacit knowledge a human author assumed. ARA's response is structural: a four-layer package — scientific logic, executable code with full specs, an exploration graph that preserves the discarded failures, and evidence grounding every claim in raw outputs.
What makes this more than a better file format is the framing of failure-knowledge as a first-class published object rather than an editing casualty. The exploration graph treats the rejected branches — the part that explains the result — as the deliverable, not the offcut. This is the supply-side complement to the demand-side problem in Can AI verify research outputs as fast as it generates them?: if agents generate faster than anyone verifies, then publishing evidence-grounded, machine-checkable artifacts is exactly the verification substrate that pace requires. It also reframes Where does AI assistance become unreliable in research? — much of what makes the "novel experiment / judgment" stage hard for agents may be the missing tacit specification ARA aims to externalize, not an irreducible capability gap.
The skeptical reading: ARA presumes the value of a paper is its reproducible specification, but a paper's narrative also does persuasive and sense-making work — it argues why the result matters, which an exploration graph does not supply. There is also an incentive problem the protocol cannot solve by structure alone: authors are rewarded for clean stories, and publishing one's failed branches exposes the messiness that the Storytelling Tax was designed to hide. The format can make failure-knowledge expressible; it cannot by itself make researchers want to expose it.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can AI verify research outputs as fast as it generates them?
Research suggests AI systems produce plausible findings rapidly but struggle to verify them at the same pace. This creates a bottleneck in verification across all research stages. Understanding this gap matters for assessing when AI assistance is reliable versus risky.
grounds: ARA's evidence layer is a verification substrate for the generation-outpaces-verification problem
-
Where does AI assistance become unreliable in research?
This explores whether AI capability follows a sharp boundary in research tasks, and what determines which side of that line a task falls on. Understanding this matters because it reveals where humans must stay in control.
extends: missing tacit specification (the Engineering Tax) may explain part of the assistance/autonomy boundary
-
Can frontier exams really measure cutting-edge AI capability?
Popular benchmarks like MMLU saturate quickly, hiding real capability differences. Can expert-designed closed-ended exams like Humanity's Last Exam discriminate at the frontier, and what would high scores actually tell us about AI systems?
convergent-with: both target agents as first-class research participants rather than human assistants
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- The Last Human-Written Paper: Agent-Native Research Artifacts
- AI for Auto-Research: Roadmap & User Guide
- AI-Researcher: Autonomous Scientific Innovation
- AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
- From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Deep Researcher with Test-Time Diffusion
- Virtuous Machines: Towards Artificial General Science
Original note title
the narrative paper is a lossy compiler — agent-native research artifacts preserve the failed branches and full specs that storytelling discards