SYNTHESIS NOTE

Topics›Flaws›this note

Do frontier LLMs silently corrupt documents in long workflows?

Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.

Synthesis note · 2026-05-18 · sourced from Flaws

Delegation requires trust — the expectation that an LLM will execute a task without introducing errors. DELEGATE-52 stress-tests that expectation with 310 work environments across 52 domains (coding, crystallography, music notation, genealogy) and a round-trip relay protocol where each task is paired with its inverse, so a perfect model would recover the original document exactly.

Across 19 LLMs, even frontier systems (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows. Weaker models fail more severely. The degradation curve decelerates but does not plateau — the first half of an extended relay accounts for 2-3x more loss than the second half, yet the strongest model still drops below 60% accuracy by round-trip 50. Distractor files, longer documents, and longer interactions all worsen the rate.

The structural problem: errors are sparse but severe and they compound silently. A user reviewing one or two outputs sees competent work. A user delegating an end-to-end workflow gets a document that looks intact but contains accumulated drift in places they did not check. The trust assumption that holds at single-step interaction collapses at the timescale where delegation is actually valuable.

This is not a "weak model" finding. It is a ceiling on delegated work at the current frontier — one that scales unfavorably with exactly the workflow length that makes delegation attractive.

Inquiring lines that read this note 120

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does tokenized intelligence retain genuine value through exchange-based systems?

How does token-based production differ from digital file production?

How do prompt structure and constraints affect model instruction reliability?

Does alignment training create blind spots in detecting genuine safety threats?

How do current safety benchmarks miss pragmatic alignment failures?

When should retrieval-augmented systems decide to fetch new information?

Why does bidirectional RAG amplify the risk of corpus poisoning attacks?

How should retrieval systems optimize for multi-step reasoning during inference?

What critical LLM failures do standard benchmarks hide?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

Why do language models reinforce false assumptions instead of correcting them?

What causes silent corruption to amplify through delegated workflows?

What role does compression play in language model capability and generalization?

Do language models learn genuine linguistic structure or just surface patterns?

What coordination failures limit multi-agent LLM systems as they scale?

How can AI systems learn from failures without cascading errors?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Does parallel sampling avoid failed-branch contamination more than sequential thinking?

What memory abstraction level best enables agent knowledge reuse?

Why do multi-turn conversations degrade AI intent and coherence?

How do knowledge graphs enable efficient multi-hop reasoning over alternatives?

Why can LLMs generate ideas better than they evaluate them?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

Why should disagreement be treated as signal in collaborative reasoning?

Can measuring semantic entropy help us detect unreliable generations?

Can alternative training methods improve on supervised fine-tuning for language models?

Can curriculum degradation of document quality accelerate policy learning?

How do language models establish social grounding in human dialogue?

How does face-saving avoidance drive LLM grounding failures?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

What are the consequences of models training on synthetic data?

What training data contamination rates threaten model safety most practically?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does trajectory filtering handle noise when language models use code execution tools?

Why does verification consistently lag behind AI generation?

How should conversational agents balance goal-driven initiative with user control?

How do insert-expansions help systems probe users before silently diverging?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

Can prompting inject entirely new knowledge into language models?

Is prompt engineering a workaround rather than a capability fix?

How do professional roles and expertise transform with AI-generated content?

How does fluent output mask the mythic function of a system?

Why do benchmark improvements fail to reflect actual reasoning quality?

What mechanisms enable AI systems to generate and spread false beliefs?

How can we detect dishonesty in model outputs separate from capability failures?

How does AI assistance affect human cognitive development and reasoning autonomy?

What debugging behaviors signal that a user has abandoned the coding loop?

Why do correct reasoning traces tend to be shorter than incorrect ones?

What makes extended chains more vulnerable than standard prompts?

How do evaluation biases undermine LLM quality assessment systems?

Why do backward-looking benchmarks underestimate LLM scientific value?

Why do readers trust citations and complexity regardless of accuracy?

What makes provenance infrastructure more critical than artifact quality?

What drives capability and cost efficiency in agent systems?

What separates good workflow design from poor workflow design?

Can single-axis benchmarks accurately predict agent deployment success?

How should benchmarks evaluate workflow architecture versus raw model performance?

How do standardized protocols improve coordination in multi-agent systems?

Why does consolidated memory sometimes degrade agent performance?

What memory architectures best support persistent reasoning across extended interactions?

Does domain specialization cause models to lose capabilities elsewhere?

How should agents balance memory condensation to optimize context efficiency?

Can self-supervised signals enable process supervision without human annotation?

Do synthetic verification chains from long-CoT models match the quality of human-annotated process labels?

How can humans calibrate appropriate trust in AI systems?

What makes a deployment paradigm credible for maintaining scientific integrity?

Does self-reflection enable models to reliably correct their errors?

When do multi-agent approaches outperform single model extended thinking?

Can smaller LLMs perform tool use tasks through modular decomposition?

How can AI agents autonomously learn and transfer skills across tasks?

Does bounding textual edits prevent skill degradation better than free rewriting?

How should memory consolidation strategies shape agent performance over time?

What specific failure modes emerge when agents retrieve stale or contaminated memories?

Why do self-improving systems struggle without clear external performance metrics?

What four domain properties make self-healing failure loops actually work?

Does AI fluency substitute for verifiable accuracy in human judgment?

Do fluent generated summaries carry false authority over expert judgment?

Can model confidence signals reliably improve reasoning quality and calibration?

How do local soundness signals work across different problem domains?

How does memorization interact with learning and generalization?

What makes memorized paragraphs harder to corrupt than generic text?

Does externalizing cognitive work and state improve agent reliability?

Why do reasoning models fail at systematic problem-solving and search?

How do dependency errors propagate through incorrectly formalized definitions?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 130 in 2-hop network ·dense cluster Open in graph ↗

Do frontier LLMs silently corrupt documents in l… Do frontier models fail differently than weaker mo… Can better tools fix LLM document editing errors? Do short benchmarks predict how models perform ove… Do models fail worse when their own errors fill th… Why do language models fail to act on their own re…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

LLMs Corrupt Your Documents When You Delegate0.88 match · arxiv ↗
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs0.80 match · arxiv ↗
Long-Horizon-Terminal-Bench: Testing the Limits of Agents on Long-Horizon Terminal Tasks with Dense Reward-Based Grading0.78 match · arxiv ↗
LLMs Get Lost In Multi-Turn Conversation0.78 match · arxiv ↗
interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification0.78 match · arxiv ↗
How Many Instructions Can LLMs Follow at Once?0.77 match · arxiv ↗
AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?0.77 match · arxiv ↗
Problems with Cosine as a Measure of Embedding Similarity for High Frequency Words0.77 match · arxiv ↗

Original note title

frontier LLMs silently corrupt 25 percent of document content over long delegated workflows without plateauing