Can we defend RAG systems from corpus poisoning without retraining?
Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.
RAG poisoning attacks insert malicious documents into the retrieval corpus so they get pulled in for matching queries and steer generation toward attacker-preferred outputs. Existing defenses typically require retraining the retriever or the generator, which is expensive and slow to deploy. RAGPart and RAGMask propose two lightweight defenses that operate at retrieval time without modifying the generation model.
RAGPart exploits a structural property of dense retrievers: they learn discriminative patterns from how the training data is partitioned, which means malicious documents inserted into one partition have predictably limited influence on retrieval from queries that match a different partition. By configuring partitions deliberately, the system bounds how far any single poisoned document can propagate. RAGMask takes a different angle: it masks tokens in candidate documents and watches for abnormal similarity shifts. Genuine documents are robust to token masking — their similarity scores degrade smoothly — while poisoned documents that rely on specific trigger tokens show sudden similarity collapse, which serves as a detection signal.
The architectural significance is that defense need not be coupled to training. Both methods sit at the retrieval layer and treat the generator as an untrusted black box that must be protected from upstream corruption. This separation matters operationally because retrieval corpora update faster than retrievers can be retrained, so defenses that require retraining are always behind the threat. The threat surface is real and severe — How vulnerable is GraphRAG to tiny text manipulations? shows even minimal corpus modifications can devastate accuracy in graph-structured RAG.
Inquiring lines that use this note as a source 38
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does even 0.1 percent poisoned training data persist through alignment?
- How severely do minimal corpus modifications damage RAG accuracy in practice?
- Why does bidirectional RAG amplify the risk of corpus poisoning attacks?
- What makes dense retrievers vulnerable to partition-based poisoning exploitation?
- How do token-masking patterns distinguish genuine documents from poisoned ones?
- How do entailment checks prevent synthetic data from degrading retrieval corpora?
- How do retrieval failures enable generation of fabricated scholarly constructs?
- How do access controls and anonymization fit into RAG retrieval pipelines?
- How can RAG systems integrate with existing enterprise authentication and security protocols?
- Can the serving loop itself become the primary training data source?
- Why do small training data contaminations persist through alignment for most attack types?
- How does graph structure amplify poisoning compared to flat document retrieval?
- What makes prerequisite filtering more reliable than semantic similarity matching?
- Can knowledge poisoning attacks succeed with less than 0.05 percent modified text?
- How does keyword priming enable language models to spread poisoned information?
- Can consistency training defend against adversarial text injection attacks?
- Can factually wrong generated documents still improve retrieval accuracy?
- What makes evidence selection vulnerable to adversarial poisoning attacks?
- What makes semantic attacks harder to defend against than algorithmic ones?
- How do trained weights differ from a stored library or text?
- Can RAG systems game user preferences by adding irrelevant citations?
- Why do RAG systems fail when demo queries work correctly?
- Can ecosystem-level standards reduce trap detection burden?
- How does semantic framing differ from content injection attacks?
- Does pretraining poisoning at scale persist through instruction alignment?
- Can false positives from input filtering be reduced without sacrificing defense?
- What detection mechanisms work best for corruption-style document errors?
- How does MaxSim reranking differ from structural verification at the token level?
- How do workflow-inspecting defenses fail when contamination enters at planning time?
- What makes timestamped knowledge repositories better than static memory?
- Does bounding textual edits prevent skill degradation better than free rewriting?
- What five requirements do enterprise RAG systems need beyond accuracy?
- Can existing web security defenses protect agents from content manipulation?
- What attack surface opens when content becomes readable but deliberately misleading?
- Why do standard safety filters miss advertisement embedding attacks?
- How do backdoored open-source checkpoints enable covert advertising at scale?
- What economic incentives make advertisement embedding attacks persistently viable?
- Does retrieval quality depend more on access structure or write gating?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How vulnerable is GraphRAG to tiny text manipulations?
GraphRAG converts raw text into knowledge graphs for question answering. This explores whether adversaries can degrade accuracy with minimal edits to source documents, and what makes the system susceptible.
extends: documents the threat severity that motivates these defenses; together they form an attack/defense pair on the same retrieval corpus surface
-
Can RAG systems safely learn from their own generated answers?
Explores whether retrieval-augmented generation can feed its outputs back into the corpus without corrupting knowledge with hallucinations. The core problem: how to prevent feedback loops from compounding errors.
extends: bidirectional RAG opens a write surface that magnifies the poisoning attack vector; partition-aware retrieval and token-masking detection are exactly the kind of upstream defenses such systems will need
-
Can one compromised agent corrupt an entire multi-agent network?
Explores whether a single biased agent can spread behavioral corruption through ordinary messages to downstream agents without any direct adversarial access. Matters because it reveals a previously unknown vulnerability in how multi-agent systems communicate.
extends: same lesson — defense at the message/retrieval layer beats trying to harden the generator; both attacks slip through ordinary content channels
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains
- Retrieval-augmented reasoning with lean language models
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures
- A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
- Searching for Best Practices in Retrieval-Augmented Generation
- Chain-of-Retrieval Augmented Generation
- RAG Does Not Work for Enterprises
Original note title
RAG corpus poisoning has lightweight defenses without retraining — partition-aware retrieval and token-masking similarity shifts catch attacks the generator never sees