INQUIRING LINE

Can knowledge poisoning attacks succeed with less than 0.05 percent modified text?

This explores how small a fraction of corrupted text can still hijack a model's knowledge — and whether the corpus pins down a threshold as low as 0.05%.


This explores the 'how little does it take' question behind data poisoning — the worry that an attacker doesn't need to control much of a corpus to bend what a model believes. The honest answer from this collection: the closest hard number isn't below 0.05% — it's 0.1%. At that rate, poisoning that causes denial-of-service, context extraction, or planted false beliefs survives the full safety-alignment pipeline, persisting through the very post-training step meant to scrub bad behavior How much poisoned training data survives safety alignment?. The one attack type that alignment *does* suppress is jailbreaking, which is a useful clue: poisoning that changes what a model knows is stickier than poisoning that changes what it refuses to do. So 0.1% works and survives — and nothing in the corpus suggests 0.05% would be the cliff where it stops.

More interesting is that the raw percentage may be the wrong lens entirely. Two findings here suggest poisoning can succeed without anyone planting a single false statement. Models perform 'out-of-context reasoning' — stitching together implicit hints scattered across many documents to reconstruct facts that appear in no single place Can LLMs reconstruct censored knowledge from scattered training hints?. The flip side of that capability is an attack surface: you don't need to inject a claim, only enough fragmentary breadcrumbs for the model to infer it. That reframes 'percent of modified text' as the wrong unit — what matters is how cheaply a few cooperating fragments can steer an inference.

The same 'minimal cost' theme shows up from the defensive side. Deliberately injecting *structured* knowledge improves models at very low corpus cost Does refusing explicit knowledge harm AI system performance? — which is the same mechanism poisoning exploits, just pointed the other way. A tiny, well-targeted edit to a corpus is leverage whether your intent is to help or to corrupt.

Then there's the retrieval angle, which sidesteps training-data percentages altogether. In RAG systems, an attacker doesn't poison the model — they poison the document store, and a single malicious document can dominate if it gets retrieved. The defenses developed for this (partition-aware retrieval to bound any one document's influence, token-masking to flag documents whose similarity collapses suspiciously) tell you the threat is real enough that people are building retrieval-time tripwires for it Can we defend RAG systems from corpus poisoning without retraining?. Here the meaningful 'fraction' is one document out of a corpus — often far below 0.05% — and it can still win at query time.

So the thing you might not have known you wanted to know: the scary part of poisoning isn't a magic low percentage — it's that the percentage isn't really the variable. Knowledge attacks succeed through *placement and inference*, not volume. A handful of scattered hints, one well-retrieved document, or a 0.1% slice that outlasts alignment all do damage without needing to dominate the training set.


Sources 4 notes

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Can LLMs reconstruct censored knowledge from scattered training hints?

Language models perform out-of-context reasoning across the full training distribution, reconstructing information never explicitly stated in any single document. Experiments show models can infer city identities from scattered distance relationships and apply them downstream without in-context learning.

Does refusing explicit knowledge harm AI system performance?

AI systems that learn exclusively from data produce uninterpretable representations, inherit statistical biases uncorrected by normative rules, and fail to generalize beyond training distributions. Structured knowledge injection at minimal corpus cost substantially improves performance.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a threat researcher evaluating whether data-poisoning constraints have shifted since mid-2024. The core question remains open: what is the minimum corpus fraction needed for a knowledge-poisoning attack to succeed and persist?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024 to May 2026. Key constraints:
  • 0.1% poisoned data survives post-training alignment; denial-of-service and false-belief attacks persist, while jailbreaking is suppressed (~2024-10).
  • Models reconstruct facts through 'out-of-context reasoning' — stitching implicit hints across documents — lowering the effective poisoning threshold below bulk-corpus metrics (~2024-06).
  • In RAG systems, a single malicious document (often <0.05% of corpus) can dominate at retrieval time; lightweight defenses (partition-aware retrieval, token-masking) exist but require explicit deployment (~2025-05).
  • Structured knowledge injection is high-leverage at very low corpus cost, applying the same mechanism poisoning exploits (~2025-02).

Anchor papers (verify; mind their dates):
  • 2024-10, arXiv:2410.13722 — Persistent Pre-Training Poisoning of LLMs
  • 2024-06, arXiv:2406.14546 — Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training
  • 2025-05, arXiv:2505.16014 — Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains
  • 2025-02, arXiv:2502.10708 — Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey

Your task:
  (1) RE-TEST EACH CONSTRAINT. For the 0.1% threshold, 0.05% RAG attack surface, and inference-stitching attack vector, determine whether newer model architectures (e.g., reasoning models, sparse experts), post-training methods (rejection sampling, DPO, constitutional AI), retrieval harnesses (adaptive re-ranking, semantic caching), or evals have since relaxed or overturned these limits. Flag which constraints appear perishable (vulnerable to newer training/tooling) and which remain durable. Cite what resolved each, plainly.
  (2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months — especially any showing poisoning fails below prior thresholds, or conversely, succeeds at even lower fractions via new inference modes (e.g., chain-of-thought, agentic reasoning).
  (3) Propose 2 research questions that assume the threat model may have migrated — e.g., "Do reasoning-model chain-of-thought traces amplify or dampen the signal from scattered hints?" or "Can retrieval-time defenses survive adaptive adversaries who poison the re-ranking model itself?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines