Can we defend RAG systems from corpus poisoning without retraining?

Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.

Synthesis note · 2026-05-03

RAG poisoning attacks insert malicious documents into the retrieval corpus so they get pulled in for matching queries and steer generation toward attacker-preferred outputs. Existing defenses typically require retraining the retriever or the generator, which is expensive and slow to deploy. RAGPart and RAGMask propose two lightweight defenses that operate at retrieval time without modifying the generation model.

RAGPart exploits a structural property of dense retrievers: they learn discriminative patterns from how the training data is partitioned, which means malicious documents inserted into one partition have predictably limited influence on retrieval from queries that match a different partition. By configuring partitions deliberately, the system bounds how far any single poisoned document can propagate. RAGMask takes a different angle: it masks tokens in candidate documents and watches for abnormal similarity shifts. Genuine documents are robust to token masking — their similarity scores degrade smoothly — while poisoned documents that rely on specific trigger tokens show sudden similarity collapse, which serves as a detection signal.

The architectural significance is that defense need not be coupled to training. Both methods sit at the retrieval layer and treat the generator as an untrusted black box that must be protected from upstream corruption. This separation matters operationally because retrieval corpora update faster than retrievers can be retrained, so defenses that require retraining are always behind the threat. The threat surface is real and severe — How vulnerable is GraphRAG to tiny text manipulations? shows even minimal corpus modifications can devastate accuracy in graph-structured RAG.

Inquiring lines that read this note 42

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does alignment training create blind spots in detecting genuine safety threats?

When should retrieval-augmented systems decide to fetch new information?

How do adversarial and manipulative prompts attack reasoning models?

How should retrieval systems optimize for multi-step reasoning during inference?

Why do readers trust citations and complexity regardless of accuracy?

How do retrieval failures enable generation of fabricated scholarly constructs?

How do standardized protocols improve coordination in multi-agent systems?

What are the consequences of models training on synthetic data?

Can the serving loop itself become the primary training data source?

How do knowledge graphs enable efficient multi-hop reasoning over alternatives?

How does graph structure amplify poisoning compared to flat document retrieval?

How do training priors constrain what context information can override?

How does keyword priming enable language models to spread poisoned information?

What factors beyond surface content determine how readers extract meaning differently?

Why does finetuning cause catastrophic forgetting of model capabilities?

How do trained weights differ from a stored library or text?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

What detection mechanisms work best for corruption-style document errors?

What causes silent corruption to amplify through delegated workflows?

How do workflow-inspecting defenses fail when contamination enters at planning time?

How should memory consolidation strategies shape agent performance over time?

What makes timestamped knowledge repositories better than static memory?

How can AI agents autonomously learn and transfer skills across tasks?

Does bounding textual edits prevent skill degradation better than free rewriting?

Can language model RL training avoid reward hacking and misalignment?

What economic incentives make advertisement embedding attacks persistently viable?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 126 in 2-hop network ·medium cluster Open in graph ↗

Can we defend RAG systems from corpus poisoning … How vulnerable is GraphRAG to tiny text manipulati… Can RAG systems safely learn from their own genera… Can one compromised agent corrupt an entire multi-…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How vulnerable is GraphRAG to tiny text manipulations? GraphRAG converts raw text into knowledge graphs for question answering. This explores whether adversaries can degrade accuracy with minimal edits to source documents, and what makes the system susceptible.
extends: documents the threat severity that motivates these defenses; together they form an attack/defense pair on the same retrieval corpus surface
Can RAG systems safely learn from their own generated answers? Explores whether retrieval-augmented generation can feed its outputs back into the corpus without corrupting knowledge with hallucinations. The core problem: how to prevent feedback loops from compounding errors.
extends: bidirectional RAG opens a write surface that magnifies the poisoning attack vector; partition-aware retrieval and token-masking detection are exactly the kind of upstream defenses such systems will need
Can one compromised agent corrupt an entire multi-agent network? Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.
extends: same lesson — defense at the message/retrieval layer beats trying to harden the generator; both attacks slip through ordinary content channels

Can we defend RAG systems from corpus poisoning without retraining?

Inquiring lines that read this note 42

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 5