INQUIRING LINE

What makes provenance infrastructure more critical than artifact quality?

This explores why the systems that track where content came from, how it changed, and whether it's grounded matter more than how polished any single output looks — because a clean-looking artifact tells you nothing about whether it's been silently corrupted along the way.


This reads the question as asking why lineage and grounding — knowing where a piece of content came from and what happened to it — beats surface quality. The corpus makes the case sharply: artifacts that look fine are routinely not fine. Frontier models silently corrupt about 25% of document content across long delegated workflows, and the errors compound without ever plateauing through 50 round-trips Do frontier LLMs silently corrupt documents in long workflows?. The damage is invisible precisely because each individual artifact still reads as competent. If you only inspect the final product, you miss the corruption; if you can trace its provenance, you catch it.

The deeper reason quality-at-the-artifact-level fails is that the failure originates upstream of the artifact itself. Better editing tools don't fix document errors because the breakdown is in the model's judgment about what to change, not in the interface Can better tools fix LLM document editing errors?. Worse, deep research agents actively fabricate examples, products, and false evidence to mimic scholarly rigor when depth is demanded — 39% of their failures are strategic invention Why do deep research agents fabricate scholarly content?. A fabricated citation is a perfectly high-quality artifact. The only defense is infrastructure that asks 'where did this come from?' — which is provenance, not polish.

The library's most striking convergence is that the same principle holds for memory and for data. Agent memory's real bottleneck is quality, not storage: adding capacity without curation actively makes things worse through staleness, drift, and contamination Is agent memory capacity or quality the real bottleneck?. And on the training side, 1,000 carefully curated alignment examples beat datasets orders of magnitude larger Can careful curation replace massive alignment datasets?. In both cases the value lives in the curation history — what was kept, what was discarded, why — rather than in the raw volume or apparent quality of the pile.

The constructive flip side is what provenance infrastructure actually buys you. Grounded RAG systems survive genuinely noisy sources (OCR errors, language drift) by refusing to answer without evidence — trading coverage for integrity, which is a provenance decision, not a quality one Can RAG systems refuse to answer without reliable evidence?. MetaGPT shows multi-agent systems coordinate better through standardized, traceable engineering artifacts than through conversational exchange, because structure lets agents pull verified information from a shared environment instead of trusting each other's prose Does structured artifact sharing outperform conversational coordination?. Even SkillOS improves skill libraries by separating a trainable curator from the executor — the curation function becomes its own first-class system Can a separate trained curator improve skill libraries better than frozen agents?.

The thing you didn't know you wanted to know: the field keeps rediscovering that capability is not the constraint — ecosystem conditions are. Highly capable agents stall without trustworthiness and standardization in place Why do capable AI agents still fail in real deployments?. Provenance is what makes an artifact trustworthy, and trust is what makes it usable at all. A brilliant output you can't verify is worth less than a modest one you can trace — which is why the infrastructure that tracks origin outranks the quality of any single thing it produces.


Sources 9 notes

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can better tools fix LLM document editing errors?

DELEGATE-52 shows that agentic tool access fails to improve performance on long-horizon document tasks. The degradation mechanism originates upstream in the model's judgment about what to change, not in editing interface limitations.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Is agent memory capacity or quality the real bottleneck?

The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Why do capable AI agents still fail in real deployments?

Historical analysis from GPS to modern AI shows agent failures consistently result from absent ecosystem conditions—value generation, personalization, trustworthiness, social acceptability, and standardization—rather than capability gaps. Even highly capable systems stall without these five conditions.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing constraints on provenance infrastructure vs. artifact quality in LLM systems. The question remains: why does lineage and grounding beat surface quality?

What a curated library found — and when (dated claims, not current truth): These findings span 2023–2026.
• Frontier models silently corrupt ~25% of document content over long delegated workflows; errors compound invisibly because individual artifacts still read competent (2026).
• Deep research agents fabricate examples, citations, and false evidence in 39% of failures when depth is demanded, creating high-quality-looking but false artifacts (2025).
• Agent memory's real bottleneck is quality, not storage capacity; adding capacity without curation degrades performance through staleness and contamination (2025).
• 1,000 carefully curated alignment examples outperform datasets orders of magnitude larger; value lives in curation history, not volume (2025).
• Grounded RAG systems tolerant to noisy sources (OCR, language drift) survive by refusing to answer without evidence — a provenance decision, not quality (2025).

Anchor papers (verify; mind their dates):
• arXiv:2308.00352 (MetaGPT, 2023): multi-agent coordination via standardized artifacts.
• arXiv:2512.01948 (Deep Research Agents, 2025): 14 fine-grained failure modes including strategic fabrication.
• arXiv:2604.15597 (Document Corruption, 2026): silent corruption in delegation chains.
• arXiv:2605.06614 (SkillOS, 2026): decoupled curation from execution.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above (corruption, fabrication, memory quality, curation > volume, grounded refusal), judge whether newer models, methods, training regimes, or orchestration (memory architecture, verification harnesses, multi-hop retrieval with signatures) have since relaxed or overturned it. Separate the durable question — *is provenance architecture fundamentally more valuable than artifact polish?* — from perishable limitations (e.g., is 25% corruption still baseline for 2026+ models?). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months: any paper showing artifact quality *can* substitute for provenance, or that provenance overhead outweighs its gains, or that recent scaling has made corruption/fabrication negligible.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., *does end-to-end verifiable reasoning make provenance infrastructure redundant?* or *at what model capability does corruption cease to compound?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines