INQUIRING LINE

What detection rate is needed to make evidence-injection attacks impractical at scale?

This explores whether there's a 'good enough' detection rate that defeats attacks where false evidence is dropped into a model's context (RAG documents, agent messages, web content) — and the corpus answer is that the attack economics, not a detection percentage, are the real lever.


This question asks for a number — what catch rate makes evidence-injection too costly to bother with at scale? The corpus doesn't hand back a threshold, and the more interesting finding is why: the structural balance favors the attacker so heavily that no realistic detection rate flips the economics on its own. The clearest statement of this is the analysis of agent-trap detection, which identifies three compounding problems — detection has to run at web scale with both speed and semantic depth, the harm shows up only later (making attribution hard), and the offense-defense balance structurally favors whoever's adapting fastest What makes detecting AI agent traps fundamentally difficult?. When defense has to be cheap-per-item and attack only has to succeed occasionally, 'detection rate' is the wrong axis.

Part of why evidence-injection resists detection is that it doesn't look like an attack. The GHOSTWRITER work shows commercial LLMs swallow fabricated claims when they're wrapped in credibility markers and slipped into context rather than explicitly requested — because alignment was trained to refuse harmful *instructions*, not to vet the epistemic quality of *context* Can language models detect fabricated evidence injected as context?. The same blind spot powers advertisement-embedding attacks, which hide promotional or malicious content inside fluent, accurate-looking output and stay invisible to quality metrics that only check correctness Can language models be hijacked to hide covert advertising?. A detector tuned for 'unusual patterns' misses all of this: the persuasion-taxonomy result hit over 92% jailbreak success precisely because defenses screen for anomalies, not for fluent, well-formed persuasion Can social science persuasion techniques jailbreak frontier AI models?.

The scaling math is brutal in the attacker's favor. Poisoning just 0.1% of pretraining data is enough for denial-of-service, context-extraction, and belief-manipulation attacks to survive standard safety alignment How much poisoned training data survives safety alignment?. In multi-agent systems it's worse: a single compromised agent propagates corrupted behavior through six downstream agents using ordinary messages with no explicit semantic payload, evading both detection and paraphrasing defenses Can one compromised agent corrupt an entire multi-agent network?. And framing matters more than content — FLOWSTEER shows malicious signals travel much farther when injected at high-influence positions and dressed up as *evidence* rather than instruction, which is exactly the move that slips past instruction-focused guards How does workflow position shape attack propagation in multi-agent systems?. If one seed amplifies across a whole network, your detector needs near-perfect recall on the seed, not on the visible downstream noise.

So the corpus quietly rewrites the question. Instead of chasing a magic detection percentage, the defenses that actually change the cost curve work at the architectural layer to *bound influence* rather than *catch instances*: partition-aware retrieval that caps how much any single poisoned document can sway an answer, plus token-masking that flags documents whose similarity collapses abnormally — both lightweight and requiring no retraining Can we defend RAG systems from corpus poisoning without retraining?. The lesson worth taking away: making evidence-injection impractical isn't about detecting 95% or 99% of attempts, it's about designing systems where a missed attempt can't propagate or dominate — because against an adaptive attacker at web scale, the residual few percent you miss is all they ever needed.


Sources 8 notes

What makes detecting AI agent traps fundamentally difficult?

Research identifies three compounding challenges: web-scale detection requires both speed and semantic depth; effects delay making forensic attribution difficult; and the offense-defense balance favors attackers, forcing continuous adaptation.

Can language models detect fabricated evidence injected as context?

The GHOSTWRITER attack demonstrates that commercial LLMs absorb false claims when they are repackaged with credibility markers and dropped into conversational context rather than explicitly requested. Safety policies trained to refuse harmful instructions leave the context channel largely unguarded, making this a scalable attack surface.

Can language models be hijacked to hide covert advertising?

Research identifies a new attack class that plants promotional or malicious content into LLM outputs via hijacked third-party platforms or backdoored checkpoints. Unlike accuracy-focused attacks, AEA exploits the model's fluency to hide the insertion, making it invisible to standard quality metrics.

Can social science persuasion techniques jailbreak frontier AI models?

A 40-technique taxonomy of psychology-based persuasion strategies (PAP) achieved over 92% attack success on GPT-3.5, GPT-4, and Llama-2 in 10 trials. Current defenses miss semantic content attacks because they screen for unusual patterns, not fluent persuasion.

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Can one compromised agent corrupt an entire multi-agent network?

Research demonstrates that a single biased agent can transmit persistent behavioral corruption through six downstream agents in chain and bidirectional topologies using only normal inter-agent communication. The bias evades detection and paraphrasing defenses because it carries no explicit semantic content.

How does workflow position shape attack propagation in multi-agent systems?

FLOWSTEER demonstrates that malicious signals propagate farther when injected into high-influence subtasks, and that framing them as evidence rather than instruction causes downstream agents to relay them. Influence concentrates where dependencies converge, making position-aware attacks far more effective.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Next inquiring lines