Can small models learn to ground answers in context?
Does model size determine whether a system can cite evidence, refuse to answer, and reason over passages jointly? Or can training data alone teach these behaviors at any scale?
The default story of the last few years is that capability tracks scale: bigger weights absorb more world knowledge, and knowledge is what makes a model useful. OCC-RAG inverts the premise for one important task. For context-grounded QA, parametric knowledge is not an asset — it is the contamination source, because a model that answers from memory is a model that can confabulate when the supplied passages are thin. Therefore the design goal becomes the opposite of scale: produce a small model that reasons over the provided context and ignores what it memorized.
What matters is that the three properties usually treated as separate — multi-hop reasoning over passages, literal-quote citation, and calibrated abstention — were jointly trained into a 0.6B/1.7B model via a synthetic corpus of 3M+ examples, and the result beats stronger sub-4B baselines. This reframes faithfulness as a supervision-format problem rather than a capacity problem. The curriculum teaches the model what to do when evidence is insufficient (abstain) and how to tie each claim to a literal span, which is exactly the behavior that Can RAG systems refuse to answer without reliable evidence? identifies as the load-bearing RAG primitive.
The strongest counterargument is that small models simply have less to hallucinate from, so abstention is cheap for them — the result might not transfer to frontier models whose parametric pull is far stronger. But that is also the point: if faithfulness is a learnable format, the lever is the training data, not the parameter count, and the same curriculum could in principle be applied at any scale. The citation behavior also carries a risk worth flagging — since Do users trust citations more when there are simply more of them?, literal-quote citations can manufacture trust independent of whether the grounding is actually sound.
Inquiring lines that use this note as a source 2
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can RAG systems refuse to answer without reliable evidence?
Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.
exemplifies the same refuse-without-evidence primitive, here baked into a small model's training curriculum
-
Can models express uncertainty instead of just answering?
Most factuality work expands what models know rather than what they know they know. Can expressing calibrated uncertainty create a third path between confident errors and unhelpful abstention?
grounds OCC-RAG's calibrated abstention in a broader account of uncertainty expression
-
Do users trust citations more when there are simply more of them?
Explores whether citation quantity alone influences user trust in search-augmented LLM responses, independent of whether those citations actually support the claims being made.
complicates the citation feature: literal-quote grounding can inflate perceived trust independent of actual faithfulness
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- OCC-RAG: Optimal Cognitive Core for Faithful Question Answering
- Measuring Faithfulness in Chain-of-Thought Reasoning
- VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- Reverse Thinking Makes LLMs Stronger Reasoners
- AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
- Can We Trust AI Explanations? Evidence of Systematic Underreporting in Chain-of-Thought Reasoning
Original note title
faithfulness is a training curriculum not a scale property — small models can learn context-grounding, citation, and abstention jointly