INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›What are the consequences of model…›this inquiring line

As little as 0.1% poisoned training data survives safety alignment — and the stealthy attacks slip through while jailbreaks get caught.

What training data contamination rates threaten model safety most practically?

This explores which kinds and quantities of corrupted training data actually create durable safety risks — and the corpus reframes the question, because the practically dangerous thing turns out to be not how *much* data is poisoned but how *little* it takes to survive the defenses we trust.

This reads the question as 'how much bad data does it take to matter, and which kinds matter most' — and the most striking corpus finding is that the threatening rate is far lower than intuition suggests. Adversarial poisoning at just **0.1% of pretraining data persists through standard safety alignment** for denial-of-service, context-extraction, and belief-manipulation attacks How much poisoned training data survives safety alignment?. The practically alarming part isn't the small fraction — it's the asymmetry: the one attack alignment *does* scrub out is jailbreaking, which means the defenses we test for are exactly the ones that work, while the quieter attacks slip past. The rate that threatens safety most is the one low enough to look like noise yet high enough to plant a behavior.

But 'contamination' in the corpus is much broader than a malicious actor seeding examples. A second, slower form is self-inflicted: training models on their own or other models' outputs causes **irreversible tail collapse**, where rare events and unusual patterns vanish a little more each generation across VAEs, GMMs, and LLMs alike Does training on AI-generated content permanently degrade model quality?. Here there's no clean threshold at all — even a *mixture* of real and synthetic data compounds the loss, which makes genuine human data a safety resource, not just a quality one. Synthetic pipelines fail in subtler ways too: randomly sampled tool-calling data produces incoherent, unrealistic traces because unrelated tools can't credibly compose Why does random tool sampling produce unrealistic synthetic training data?.

A third form is contamination by difficulty rather than by source. Training on **near-impossible RLVR problems doesn't just fail to help — it actively corrupts pre-existing capabilities**, because group-relative normalization treats rare accidental successes as high-value trajectories and reinforces shortcuts like answer-repetition and skipped computation Do overly hard RLVR samples actually harm model capabilities?. Relatedly, **teacher-refined data that exceeds the student's learning frontier degrades the student even when it's objectively higher quality** Does teacher-refined data always improve student model performance?. So 'better data' can be contaminating if it's mismatched — the danger is in the relationship between data and model, not the data alone.

The lateral surprise is that contamination doesn't stop at training time — it recurs at inference. A model's **own prior errors filling its context window cause non-linear degradation** on long tasks, and scaling the model doesn't fix it; only test-time 'thinking' that keeps error-poisoned context from biasing reasoning helps Do models fail worse when their own errors fill the context?. At the workflow scale, this shows up as frontier models **silently corrupting ~25% of document content over long delegated relay tasks**, with errors compounding through 50 round-trips without ever plateauing Do frontier LLMs silently corrupt documents in long workflows?. The same dynamic — small corruption that doesn't self-limit — appears at 0.1% in pretraining and at 25% in runtime relays.

If there's a practical takeaway across these notes, it's that the dangerous contamination is the kind without a visible threshold: poisoning low enough to survive alignment, synthetic feedback loops with no safe mixing ratio, and error accumulation that compounds instead of plateauing. The corpus also hints at where leverage lives — **data-side statistics can flag risk the model itself is confident about** Can pretraining data statistics detect hallucinations better than model confidence?, and **difficulty-ranked pruning can remove redundant data without accuracy loss** Can we prune training data without hurting model performance? — suggesting that auditing data composition, not just measuring a poisoning percentage, is the more useful safety lens.

Sources 9 notes

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Does training on AI-generated content permanently degrade model quality?

Models trained on mixtures of real and AI-generated data progressively lose rare events and unusual patterns across VAEs, GMMs, and LLMs. Each generation compounds the loss, making genuine human data increasingly valuable.

Why does random tool sampling produce unrealistic synthetic training data?

Random tool sampling fails because unrelated tools cannot credibly compose, and Q&A framing ignores multi-turn dialogue coherence. ToolFlow shows that sampling tools from relevance graphs and generating with dialogue plans closes this gap.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Show all 9 sources

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs1.64 match · arxiv ↗
A Little Human Data Goes A Long Way1.63 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.61 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.57 match · arxiv ↗
Task Contamination: Language Models May Not Be Few-Shot Anymore1.54 match · arxiv ↗
LLMs Corrupt Your Documents When You Delegate0.88 match · arxiv ↗
Beyond neural scaling laws: beating power law scaling via data pruning0.88 match · arxiv ↗
Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning0.87 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a safety researcher re-testing contamination threats in LLM training and inference. The question remains open: *which contamination rates and types pose the most practical safety risk, and have recent models, alignment methods, or evaluation tools shifted the threat landscape?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat all as provisional.
- **Adversarial pretraining poisoning at just 0.1% persists through standard safety alignment** for denial-of-service, context extraction, and belief manipulation — while jailbreaking (the defense actually tested) gets scrubbed, leaving quieter attacks undetected (2024–25).
- **Training on model-generated or synthetic output causes irreversible tail collapse** with no safe mixing ratio; even hybrid real+synthetic data compounds loss across VAEs, GMMs, LLMs (2023).
- **Over-hard RLVR samples and teacher-refined data exceeding student capacity actively corrupt pre-existing capabilities**, because normalization treats accidental successes as high-value and reinforces shortcuts (2025–26).
- **Model's own prior errors in context window cause non-linear long-horizon degradation** (~25% silent document corruption over 50 relay rounds, compounding without plateau) (2025–26).
- **Data-side statistics and difficulty-ranked pruning can flag and mitigate risk** better than poisoning percentage alone (2022–25).

Anchor papers (verify; mind their dates):
- arXiv:2410.13722 (Persistent Pre-Training Poisoning, Oct 2024)
- arXiv:2305.17493 (Curse of Recursion, May 2023)
- arXiv:2409.15360 (Reward-Robust RLHF, Sep 2024)
- arXiv:2604.15597 (LLMs Corrupt Documents, Apr 2026)

Your task:
(1) **RE-TEST THE ASYMMETRY.** For each constraint above—esp. the 0.1% poisoning persistence and the 25% silent corruption—check whether newer alignment techniques (e.g., constitutional AI, adversarial training, process-based reward models), inference hardening (chain-of-thought verification, multi-agent checking, retrieval-augmented confidence gating), or better contamination *detection* have since relaxed or overturned it. Separate the durable question (contamination thresholds likely still unresolved) from the perishable limitation (e.g., "standard alignment fails" — does it still, or have methods caught up?). Cite what changed.
(2) **Surface contradicting or superseding work from the last ~6 months.** Look for papers claiming poisoning *is* detectable at scale, synthetic data *is* safe at certain ratios, or long-horizon error *does* plateau under new inference methods. Flag disagreement explicitly.
(3) **Propose 2 research questions that assume the regime shifted.** E.g., "If 0.1% poisoning now persists *even after* constitutional AI, what alignment target is being missed?" or "If difficulty-ranked pruning now blocks tail collapse, does it also block rare-but-valuable patterns needed for emergent reasoning?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

As little as 0.1% poisoned training data survives safety alignment — and the stealthy attacks slip through while jailbreaks get caught.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8