SYNTHESIS NOTE

Topics›this note

Can RAG systems refuse to answer without reliable evidence?

Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.

Synthesis note · 2026-05-03

A hybrid multilingual RAG system for question answering over noisy historical newspapers handles two kinds of corruption that modern RAG benchmarks largely ignore: OCR errors that scramble surface text and language drift where vocabulary and orthography shift across centuries within the same corpus. Its defense against both is structural rather than denoising. The pipeline uses semantic query expansion to widen what counts as a match, multi-query retrieval with Reciprocal Rank Fusion to consolidate evidence across query variants, and — most importantly — a grounded generation prompt that only produces answers when evidence is actually retrieved.

The grounded-refusal step is what distinguishes this from a typical noisy-RAG approach. When sources are corrupted, the temptation is for the generator to fill in the gaps from prior knowledge, which produces plausible-sounding but ungrounded answers. The grounded prompt makes refusal the default when retrieval fails, which preserves the integrity of the answer at the cost of coverage. Combined with the semantic and multi-query expansion that improves recall on degraded text, the system trades hallucination for honest "I cannot find this" responses. The cost of this trade is real: Does reasoning fine-tuning make models worse at declining to answer? shows that recent training trends actively work against this kind of refusal posture.

The general principle is that corruption-tolerant RAG should expand retrieval aggressively while constraining generation conservatively — recall up, but only generate when grounded. This inverts the implicit policy of most RAG systems, which is to retrieve narrowly and generate freely. For high-noise corpora the inversion is the correct trade.

Inquiring lines that read this note 76

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do readers trust citations and complexity regardless of accuracy?

Can AI-generated outputs constitute genuine knowledge or valid claims?

How do archive systems handle knowledge that changes with each generation?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

Why does verification consistently lag behind AI generation?

Can ensemble evaluation methods reduce bias more than single judges?

Can beam search and ranking functions evaluate claims without understanding counterarguments?

Does AI text rewriting systematically distort writer intent and preference?

When do readers defer to AI text without genuine processing?

When should retrieval-augmented systems decide to fetch new information?

How do adversarial and manipulative prompts attack reasoning models?

How should retrieval systems optimize for multi-step reasoning during inference?

Why does finetuning cause catastrophic forgetting of model capabilities?

How do byte-level representations enable better handling of typos than tokens?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How does era sensitivity in legal cases compound with context length failures?

How do training data properties shape reasoning capability development?

Can verifier-guided search catch factual errors that reasoning training cannot?

Why do semantic similarity and task relevance diverge in vector embeddings?

What makes retrieval augmentation more effective than simply increasing embedding size?

How do knowledge injection methods compare across cost and effectiveness?

What hidden costs might fine-tuning retrieval models introduce on out-of-distribution queries?

Can language model hallucination be prevented or only managed?

What makes specific clarifying questions more effective than generic ones?

What structural advantages do diffusion language models offer over autoregressive methods?

How can identical external performance mask different internal representations?

What happens when prompt-optimized results lack anchoring in real data?

How should iterative research systems allocate reasoning per search step?

Can retrieval strategies drive both draft refinement and new research question generation?

Which computational strategies best support reasoning in language models?

Does RL pruning of documents differ fundamentally from rationale-driven evidence selection?

How do evaluation biases undermine LLM quality assessment systems?

Why does probability of text completion not equal knowledge value?

How do language models inherit human biases from training data?

What governance safeguards could constrain misuse of demographic inference?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

What detection mechanisms work best for corruption-style document errors?

How do evaluation mechanisms prevent error accumulation in autonomous research systems?

Why do frontier model failures in document editing go undetected by users?

Does AI fluency substitute for verifiable accuracy in human judgment?

Do fluent generated summaries carry false authority over expert judgment?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 140 in 2-hop network ·medium cluster Open in graph ↗

Can RAG systems refuse to answer without reliabl… Does reasoning fine-tuning make models worse at de… Does training objective determine which direction … Can any computable LLM truly avoid hallucinating? Why do queries and documents occupy different embe…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does reasoning fine-tuning make models worse at declining to answer? When models are trained to reason better, do they lose the ability to say 'I don't know'? This matters for high-stakes applications like medical and legal AI that depend on appropriate uncertainty.
extends: documents the model-side obstacle to grounded refusal — recent fine-tuning regimes actively suppress the abstention capacity this RAG primitive depends on
Does training objective determine which direction models fail at abstention? Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
extends: explains why the grounded-refusal prompt has to be explicit — without it, the underlying model's training objective biases generation away from "I don't know"
Can any computable LLM truly avoid hallucinating? Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.
supports: gives the formal reason grounded-refusal is the right RAG primitive for noisy corpora — confabulation cannot be eliminated at the model level, only mediated by retrieval-time policy
Why do queries and documents occupy different embedding spaces? Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.
extends: same retrieval-side widening move (semantic query expansion ≈ HyDE) but coupled with grounded refusal rather than open generation

Can RAG systems refuse to answer without reliable evidence?

Inquiring lines that read this note 76

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4