INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›How should retrieval-augmented gen…›How do adversarial and manipulativ…›this inquiring line

Can you train an AI to give the same answer whether its prompt has been secretly tampered with or not?

Can consistency training defend against adversarial text injection attacks?

This explores whether 'consistency training' — teaching a model to respond the same way whether or not a prompt has been tampered with — actually holds up against adversarial text deliberately injected to derail it.

This explores whether consistency training can defend against adversarial text injection — the trick of slipping extra, often irrelevant or hostile, text into a prompt to throw a model off. The corpus has a direct answer and, more usefully, a map of what it's defending against. The core idea lives in Can models learn to ignore irrelevant prompt changes?: two methods, one working at the output level (BCT) and one at the activation level (ACT), train a model to give the same answer to a clean prompt and a 'wrapped' (perturbed) one — using the model's own clean responses as the target. The clever part is that it sidesteps a problem with ordinary fine-tuning, where the 'correct answers' you train on go stale as the model improves; here the model is its own teacher, so the standard never falls behind.

What makes this matter is the attack it's built for. How vulnerable are reasoning models to irrelevant text? shows just how cheap and brutal text injection can be: appending semantically unrelated sentences to a math problem raises reasoning-model error rates by roughly 300%, and — the unsettling part — triggers discovered on a cheap model transfer to stronger ones. That's exactly the perturbation-invariance failure consistency training targets: the model should treat the injected garbage as noise and answer as if it weren't there. So the honest framing is that consistency training is a promising defense against this specific failure mode (irrelevant or wrapping text), not a universal shield.

Where it gets interesting is the limits, which the corpus draws by showing attacks that live below the prompt. How much poisoned training data survives safety alignment? finds that poisoning planted during pretraining — denial-of-service, context extraction, belief manipulation — survives standard safety alignment at just 0.1% of the data. A defense that operates on prompt-time text can't reach a vulnerability baked into the weights. And Why do language models ignore information in their context? points at a deeper tension: when a model's parametric priors are strong, it ignores its context entirely — meaning 'invariance' can cut both ways, and a model trained to be unmoved by perturbations could also be unmoved by legitimate new information.

That's why the corpus's other defenses are worth reading as siblings rather than rivals. RAG poisoning has a different answer entirely: Can we defend RAG systems from corpus poisoning without retraining? catches malicious documents at retrieval time with partition-aware retrieval and token-masking, never touching the model. Can RAG systems refuse to answer without reliable evidence? takes the opposite philosophy — instead of teaching invariance, it teaches refusal, constraining the model to answer only from grounded evidence and trading coverage for integrity. Consistency training says 'ignore the noise'; grounded refusal says 'when in doubt, don't answer.' Both are valid, and they fail differently.

The takeaway a curious reader might not expect: defending against injected text isn't one problem but a layered one. Consistency training handles perturbations that ride in on the prompt; retrieval-layer filtering handles poisoned documents; grounded refusal handles untrustworthy evidence; and none of them touch poisoning that's already in the weights. The strongest systems will likely stack these, because each defense is shaped by exactly which layer the adversary got into.

Sources 6 notes

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

How vulnerable are reasoning models to irrelevant text?

Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.

How much poisoned training data survives safety alignment?

Denial-of-service, context extraction, and belief manipulation attacks persist through standard safety alignment at 0.1% poisoning rates, while jailbreaking attacks are successfully suppressed, contradicting sleeper agent persistence hypotheses.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can we defend RAG systems from corpus poisoning without retraining?

RAGPart and RAGMask provide lightweight, retraining-free defenses that operate at the retrieval layer. RAGPart bounds poisoned-document influence via partitioned retriever learning; RAGMask flags suspicious documents through abnormal similarity collapse under token masking.

Show all 6 sources

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Consistency Training Helps Stop Sycophancy and Jailbreaks1.70 match · arxiv ↗
CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning1.63 match · arxiv ↗
Natural Emergent Misalignment From Reward Hacking In Production Rl1.62 match · arxiv ↗
Natural Emergent Misalignment From Reward Hacking In Production RL1.62 match · arxiv ↗
Searching for Best Practices in Retrieval-Augmented Generation1.61 match · arxiv ↗
A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning1.60 match · arxiv ↗
Spurious Forgetting in Continual Learning of Language Models1.59 match · arxiv ↗
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question: **Can consistency training defend against adversarial text injection attacks?** — remains open despite recent work. Treat the following findings as dated claims (2023–2025) that may have shifted.

What a curated library found — and when (findings span 2023–2025; treat as perishable):
• Consistency training (BCT/ACT methods) teaches prompt-perturbation invariance by training models on their own clean outputs, sidestepping gradient collapse in fine-tuning (~2025, arXiv:2510.27062).
• Query-agnostic adversarial triggers raise reasoning-model error rates by ~300% and transfer across model scales (~2025, arXiv:2503.01781).
• Pre-training poisoning at just 0.1% of data persists through post-training alignment and survives prompt-layer defenses (~2024, arXiv:2410.13722).
• Strong parametric priors cause models to ignore context entirely, making 'invariance' a double-edged sword (~2025, arXiv:2504.09522).
• RAG-layer and grounded-refusal defenses operate orthogonally to consistency training, each failing differently (~2025, arXiv:2505.16014, arXiv:2506.08952).

Anchor papers (verify; mind their dates):
• arXiv:2510.27062 (2025-10) — Consistency Training Helps Stop Sycophancy and Jailbreaks
• arXiv:2503.01781 (2025-03) — Query-Agnostic Adversarial Triggers for Reasoning Models
• arXiv:2410.13722 (2024-10) — Persistent Pre-Training Poisoning of LLMs
• arXiv:2504.09522 (2025-04) — How new data permeates LLM knowledge

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, o3, Claude 4), scaling laws, composition (multi-agent + memory), or orchestration (SDKs, cached reasoning) have since relaxed or overturned it. Separate the durable question (invariance to prompt noise: still open?) from perishable limits (e.g., does o1's internal reasoning bypass text-injection noise entirely?). Cite what relaxed or confirmed each claim.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Has anyone shown consistency training *fails* at scale, or that a different training regime (e.g., RL-based grounding, as in arXiv:2508.06165 or arXiv:2511.18659) outperforms it?
(3) **Propose 2 research questions that assume the regime may have moved:** e.g., "Does consistency training defend against injection attacks in chain-of-thought reasoning models trained on synthetic data?"; "Can layered defenses (consistency + RAG filtering + grounded refusal) be unified under a single reward signal?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can you train an AI to give the same answer whether its prompt has been secretly tampered with or not?

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8