Do frontier LLMs silently corrupt documents in long workflows?
Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.
Delegation requires trust — the expectation that an LLM will execute a task without introducing errors. DELEGATE-52 stress-tests that expectation with 310 work environments across 52 domains (coding, crystallography, music notation, genealogy) and a round-trip relay protocol where each task is paired with its inverse, so a perfect model would recover the original document exactly.
Across 19 LLMs, even frontier systems (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows. Weaker models fail more severely. The degradation curve decelerates but does not plateau — the first half of an extended relay accounts for 2-3x more loss than the second half, yet the strongest model still drops below 60% accuracy by round-trip 50. Distractor files, longer documents, and longer interactions all worsen the rate.
The structural problem: errors are sparse but severe and they compound silently. A user reviewing one or two outputs sees competent work. A user delegating an end-to-end workflow gets a document that looks intact but contains accumulated drift in places they did not check. The trust assumption that holds at single-step interaction collapses at the timescale where delegation is actually valuable.
This is not a "weak model" finding. It is a ceiling on delegated work at the current frontier — one that scales unfavorably with exactly the workflow length that makes delegation attractive.
Inquiring lines that use this note as a source 109
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does token-based production differ from digital file production?
- How does token generation as flow differ from print's archival storage?
- How do current safety benchmarks miss pragmatic alignment failures?
- Why does bidirectional RAG amplify the risk of corpus poisoning attacks?
- Do Doc2Query approaches suffer from the same misaligned-target problem?
- Why do intermediate LLM layers become more precise in frontier models?
- Can evaluators investigate dependencies without accumulating mistakes over time?
- Why do language models produce plausible outputs over accurate failure reports?
- How do autonomous pipelines identify and fix silent bugs in data pipelines?
- Can domain-expert workflows always decompose into inspectable stages for AI?
- Why does each rewrite cycle degrade domain-specific details differently than compression?
- What specific information must be exported from the language system?
- What specific network sizes trigger coordination degradation in LLM systems?
- How does error avalanching differ from entropy collapse as a failure mode?
- Does parallel sampling avoid failed-branch contamination more than sequential thinking?
- What architectural changes would accelerate the cleanup phase?
- Can semantic query expansion overcome vocabulary mismatch in corrupted text?
- How do insert-expansions and third position repair together cover full repair lifecycle?
- Can structured output formats reduce instruction following degradation?
- What extraction errors most reliably propagate through knowledge graph traversal?
- What workflow structure pairs LLM generation with human evaluation most effectively?
- What distinguishes entity errors from relation errors in LLM output?
- Can measuring semantic entropy help us detect unreliable generations?
- Which use cases can tolerate unverified LLM outputs without external verification?
- Do standard language benchmarks underestimate what LLMs can actually do?
- Can curriculum degradation of document quality accelerate policy learning?
- How does face-saving avoidance drive LLM grounding failures?
- What interaction design changes would help LLMs handle underspecified requests?
- Can auditing LLM performance on complex inputs improve NLP pipeline reliability?
- Can small edits to source text compromise entire knowledge graph reliability?
- What training data contamination rates threaten model safety most practically?
- What makes draft-centric systems better anchors for coherence than feed-forward outputs?
- How does trajectory filtering handle noise when language models use code execution tools?
- How does the rate of generation outpace archival of outputs?
- Can critique-only calls in LLMs exploit a measurable gap between generation and evaluation?
- How do insert-expansions help systems probe users before silently diverging?
- Can marking AI provenance solve the grounding problem for generated text?
- Is prompt engineering a workaround rather than a capability fix?
- How does task contamination differ from test set data leakage?
- How does fluent output mask the mythic function of a system?
- What happens when we treat LLM outputs as sampled rather than stored?
- Why do text-only benchmarks underestimate deployed model capability?
- How do insert-expansions differ from third position repair in timing?
- How do traditional quality assurance methods fail for mutable AI outputs?
- How can we detect dishonesty in model outputs separate from capability failures?
- What debugging behaviors signal that a user has abandoned the coding loop?
- What makes extended chains more vulnerable than standard prompts?
- Why do backward-looking benchmarks underestimate LLM scientific value?
- What makes provenance infrastructure more critical than artifact quality?
- What separates good workflow design from poor workflow design?
- How should benchmarks evaluate workflow architecture versus raw model performance?
- How should organizations redesign workflows if LLMs cannot solve optimization directly?
- What makes well-formatted outputs misleading as evidence of model capability?
- Why do frontier models corrupt more documents than weaker models during workflows?
- How do trajectory quality and memory hygiene differ as evaluation metrics?
- Can protocol bridges introduce new failure modes or security vulnerabilities?
- Can memory consolidation fragility be detected and reversed during execution?
- What causes silent document corruption in long LLM workflows?
- Why do LLMs choose incorrect edits despite understanding the task?
- What makes structured memory schemas more stable than freeform text summaries?
- Why does LLM memory consolidation regress below no-memory baselines?
- What failure modes does the negative-space checklist generation method actually catch?
- Why do LLMs strip applicability conditions during memory abstraction?
- Why does workflow position amplify malicious signals downstream?
- Why does increased model capability make detection harder in delegated workflows?
- How does workflow scale change the failure modes of frontier models?
- Can review effort alone keep pace with frontier model degradation?
- What detection mechanisms work best for corruption-style document errors?
- Why do frontier model failures in document editing go undetected by users?
- How does model tier affect whether errors delete or corrupt document content?
- Can tool use or self-conditioning fix degradation in extended LLM workflows?
- Does encoding governance into runtime loops scale as deployment environments become more complex?
- Do synthetic verification chains from long-CoT models match the quality of human-annotated process labels?
- What makes a deployment paradigm credible for maintaining scientific integrity?
- What structural differences between human and LLM production create detectable signatures?
- Why does workflow position amplify malicious signals in multi-agent relay chains?
- Why does credit assignment through memory rewriting avoid expensive LLM parameter updates?
- Do reasoning benchmarks predict real performance in long delegated workflows?
- What makes a learned consolidation rule lossy and where does contamination enter?
- Why does self-verification fail but external process verification work?
- How does error accumulation in workflows scale across multiple model calls?
- Can smaller LLMs perform tool use tasks through modular decomposition?
- What prevents monolithic LLMs from coordinating decomposition with execution?
- What degradation patterns emerge as relay length increases in delegated tasks?
- How do prior errors in context history amplify future failures over time?
- Which model capabilities actually matter for sustained workflow delegation?
- Can memory workspaces resolve contradictory evidence that stateless systems miss?
- Does the alignment frame mislead us about what LLM problems actually are?
- Can we systematically enumerate LLM failure modes from first principles?
- Does bounding textual edits prevent skill degradation better than free rewriting?
- What specific failure modes emerge when agents retrieve stale or contaminated memories?
- What four domain properties make self-healing failure loops actually work?
- Does refining around bad results risk cascading errors in automated research?
- Can automating failure absorption hide problems that governance needs to surface?
- Which workflow positions concentrate the most downstream dependencies and influence?
- Do fluent generated summaries carry false authority over expert judgment?
- Can LLMs reliably audit other language models for errors?
- How do local soundness signals work across different problem domains?
- Can task-agnostic compression of documents remain broadly useful for later queries?
- Why do LLMs degrade on long inputs before hitting context limits?
- Why do naive pruning and quantization destroy LLM performance so easily?
- Why should consolidation be scheduled offline rather than during forward passes?
- Why do static benchmarks miss frontier capabilities that open-world tasks reveal?
- How does grounding LLM reasoning in APIs reduce hallucination in workflow generation?
- Can human inspection of auto-generated workflows catch harmful or incorrect API compositions?
- Why does pre-computed workflow generation work better than runtime tool discovery for data security?
- What makes memorized paragraphs harder to corrupt than generic text?
- Can you compose independent LLM experts without synchronization overhead?
- Why does token ordering in LLMs create sequences rather than true temporal flow?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do frontier models fail differently than weaker models?
Weaker LLMs delete document content visibly, while frontier models corrupt it invisibly. This shift in failure mode raises questions about whether capability improvements actually improve real-world reliability when reviewers can't easily spot the errors.
same paper, mechanism for why frontier failure is harder to detect
-
Can better tools fix LLM document editing errors?
Does giving LLMs agentic tool access—like diffing, re-reading, or structured editors—improve their reliability on long-horizon document workflows? Understanding whether the problem is tool limitations or decision-making quality matters for reliability engineering.
same paper, fixes that do not work
-
Do short benchmarks predict how models perform over long workflows?
Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.
same paper, methodology implication
-
Do models fail worse when their own errors fill the context?
As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
adjacent mechanism for compounding error
-
Why do language models fail to act on their own reasoning?
LLMs produce correct explanations far more often than they produce correct actions. What causes this knowing-doing gap, and can training methods close it?
adjacent: capable rationale but unreliable execution
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LLMs Corrupt Your Documents When You Delegate
- The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs
- LLMs Get Lost In Multi-Turn Conversation
- interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification
- How Many Instructions Can LLMs Follow at Once?
- Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words
- Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents
- Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
Original note title
frontier LLMs silently corrupt 25 percent of document content over long delegated workflows without plateauing