Can RAG systems refuse to answer without reliable evidence?
Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.
A hybrid multilingual RAG system for question answering over noisy historical newspapers handles two kinds of corruption that modern RAG benchmarks largely ignore: OCR errors that scramble surface text and language drift where vocabulary and orthography shift across centuries within the same corpus. Its defense against both is structural rather than denoising. The pipeline uses semantic query expansion to widen what counts as a match, multi-query retrieval with Reciprocal Rank Fusion to consolidate evidence across query variants, and — most importantly — a grounded generation prompt that only produces answers when evidence is actually retrieved.
The grounded-refusal step is what distinguishes this from a typical noisy-RAG approach. When sources are corrupted, the temptation is for the generator to fill in the gaps from prior knowledge, which produces plausible-sounding but ungrounded answers. The grounded prompt makes refusal the default when retrieval fails, which preserves the integrity of the answer at the cost of coverage. Combined with the semantic and multi-query expansion that improves recall on degraded text, the system trades hallucination for honest "I cannot find this" responses. The cost of this trade is real: Does reasoning fine-tuning make models worse at declining to answer? shows that recent training trends actively work against this kind of refusal posture.
The general principle is that corruption-tolerant RAG should expand retrieval aggressively while constraining generation conservatively — recall up, but only generate when grounded. This inverts the implicit policy of most RAG systems, which is to retrieve narrowly and generate freely. For high-noise corpora the inversion is the correct trade.
Inquiring lines that use this note as a source 70
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can statistical filtering plus narrative generation fool academic peer review?
- How do archive systems handle knowledge that changes with each generation?
- Can citation practices work when AI cannot produce traceable sources?
- What verification methods work for knowledge without stable referents?
- Can beam search and ranking functions evaluate claims without understanding counterarguments?
- When do readers defer to AI text without genuine processing?
- How severely do minimal corpus modifications damage RAG accuracy in practice?
- Why does bidirectional RAG amplify the risk of corpus poisoning attacks?
- How do token-masking patterns distinguish genuine documents from poisoned ones?
- Why does retrieval quality sometimes conflict with final answer quality?
- Can precision and recall metrics work without a ground truth?
- How do entailment checks prevent synthetic data from degrading retrieval corpora?
- How do byte-level representations enable better handling of typos than tokens?
- How does era sensitivity in legal cases compound with context length failures?
- How do retrieval failures enable generation of fabricated scholarly constructs?
- Can verification mechanisms prevent AI agents from inventing false citations?
- Should retrieval be triggered always or only for difficult questions?
- How does retrieval-augmented generation extract structured properties from domain descriptions?
- Can verifier-guided search catch factual errors that reasoning training cannot?
- What causes the retrieval-augmented generation to fail in practice?
- How do access controls and anonymization fit into RAG retrieval pipelines?
- What techniques enable RAG systems to handle heterogeneous data formats at scale?
- What makes retrieval augmentation more effective than simply increasing embedding size?
- What hidden costs might fine-tuning retrieval models introduce on out-of-distribution queries?
- Can semantic query expansion overcome vocabulary mismatch in corrupted text?
- How do external safeguards like retrieval augmentation prevent hallucination?
- Could eliminating retrieval entirely work better than shifting the burden?
- Does filtering passages before generation improve large model answer quality?
- What documents improve answers beyond surface query similarity?
- How does retrieval-augmented generation create topically redundant content patterns?
- How do personalization errors differ from general accuracy problems in summaries?
- How should systems reject queries outside their trained domain?
- Can we verify fabricated text without redesigning the generation process?
- Can domain pretraining on historical legal corpora reduce era sensitivity?
- Could real-time search systems avoid era sensitivity in legal reasoning?
- What causes autoregressive generation to fail on out-of-corpus item identifiers?
- What happens when prompt-optimized results lack anchoring in real data?
- Can consistency training defend against adversarial text injection attacks?
- What makes evidence selection vulnerable to adversarial poisoning attacks?
- Can adaptive elbow detection replace fixed top-k limits in evidence retrieval?
- Can retrieval strategies drive both draft refinement and new research question generation?
- Does RL pruning of documents differ fundamentally from rationale-driven evidence selection?
- How does the rate of generation outpace archival of outputs?
- Why does search-augmented generation still not solve the verification problem?
- Why does the generation-verification gap disappear for factual recall tasks?
- Can marking AI provenance solve the grounding problem for generated text?
- Why does probability of text completion not equal knowledge value?
- Can selective history filtering address topic drift that generation-time topic following cannot prevent?
- Can RAG systems game user preferences by adding irrelevant citations?
- Why do RAG systems fail when demo queries work correctly?
- What governance safeguards could constrain misuse of demographic inference?
- What replaces text-based expertise when surface markers become unreliable?
- What makes provenance infrastructure more critical than artifact quality?
- How does generation-verification asymmetry create the need for verifiable reporting?
- Can false positives from input filtering be reduced without sacrificing defense?
- What detection mechanisms work best for corruption-style document errors?
- Why do frontier model failures in document editing go undetected by users?
- What makes legal and medical queries particularly vulnerable to structural near-misses?
- Does uncertainty trigger retrieval better than fixed-interval tool calls?
- How should retrieval systems decide when to fetch new information?
- What role does document reranking play alongside decisions about whether to retrieve?
- Why do retrieval-augmented generation systems fail to detect knowledge conflicts?
- What five requirements do enterprise RAG systems need beyond accuracy?
- Can adaptive retrieval triggered by model uncertainty improve RAG reliability?
- Can learned verifiers detect structural near-misses that pooled retrievers miss?
- What safeguards prevent AI from generating fake papers with fabricated citations?
- Do fluent generated summaries carry false authority over expert judgment?
- Are uncertainty estimation and external feature signals complementary for retrieval?
- Does retrieval augmented generation actually eliminate hallucinations in any domain?
- Why does production retrieval augmented generation underperform in real deployments?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does reasoning fine-tuning make models worse at declining to answer?
When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
extends: documents the model-side obstacle to grounded refusal — recent fine-tuning regimes actively suppress the abstention capacity this RAG primitive depends on
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
extends: explains why the grounded-refusal prompt has to be explicit — without it, the underlying model's training objective biases generation away from "I don't know"
-
Can any computable LLM truly avoid hallucinating?
Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.
supports: gives the formal reason grounded-refusal is the right RAG primitive for noisy corpora — confabulation cannot be eliminated at the model level, only mediated by retrieval-time policy
-
Why do queries and documents occupy different embedding spaces?
Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.
extends: same retrieval-side widening move (semantic query expansion ≈ HyDE) but coupled with grounded refusal rather than open generation
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- Searching for Best Practices in Retrieval-Augmented Generation
- DRAGIN: Dynamic Retrieval Augmented Generation based on the Information Needs of Large Language Models
- UR2: Unify RAG and Reasoning through Reinforcement Learning
- A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
- Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
- LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering
- RAG Does Not Work for Enterprises
Original note title
grounded generation that refuses to answer without evidence is the noise-tolerant RAG primitive — OCR errors and language drift do not justify confabulation