SYNTHESIS NOTE

Can small models learn to ground answers in context?

Does model size determine whether a system can cite evidence, refuse to answer, and reason over passages jointly? Or can training data alone teach these behaviors at any scale?

Synthesis note · 2026-06-27 · sourced from Reasoning Critiques

The default story of the last few years is that capability tracks scale: bigger weights absorb more world knowledge, and knowledge is what makes a model useful. OCC-RAG inverts the premise for one important task. For context-grounded QA, parametric knowledge is not an asset — it is the contamination source, because a model that answers from memory is a model that can confabulate when the supplied passages are thin. Therefore the design goal becomes the opposite of scale: produce a small model that reasons over the provided context and ignores what it memorized.

What matters is that the three properties usually treated as separate — multi-hop reasoning over passages, literal-quote citation, and calibrated abstention — were jointly trained into a 0.6B/1.7B model via a synthetic corpus of 3M+ examples, and the result beats stronger sub-4B baselines. This reframes faithfulness as a supervision-format problem rather than a capacity problem. The curriculum teaches the model what to do when evidence is insufficient (abstain) and how to tie each claim to a literal span, which is exactly the behavior that Can RAG systems refuse to answer without reliable evidence? identifies as the load-bearing RAG primitive.

The strongest counterargument is that small models simply have less to hallucinate from, so abstention is cheap for them — the result might not transfer to frontier models whose parametric pull is far stronger. But that is also the point: if faithfulness is a learnable format, the lever is the training data, not the parameter count, and the same curriculum could in principle be applied at any scale. The citation behavior also carries a risk worth flagging — since Do users trust citations more when there are simply more of them?, literal-quote citations can manufacture trust independent of whether the grounding is actually sound.

Inquiring lines that use this note as a source 2

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 131 in 2-hop network ·dense cluster Open in graph ↗

Can small models learn to ground answers in cont… Can RAG systems refuse to answer without reliable … Can models express uncertainty instead of just ans… Do users trust citations more when there are simpl…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can RAG systems refuse to answer without reliable evidence? Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.
exemplifies the same refuse-without-evidence primitive, here baked into a small model's training curriculum
Can models express uncertainty instead of just answering? Most factuality work expands what models know rather than what they know they know. Can expressing calibrated uncertainty create a third path between confident errors and unhelpful abstention?
grounds OCC-RAG's calibrated abstention in a broader account of uncertainty expression
Do users trust citations more when there are simply more of them? Explores whether citation quantity alone influences user trust in search-augmented LLM responses, independent of whether those citations actually support the claims being made.
complicates the citation feature: literal-quote grounding can inflate perceived trust independent of actual faithfulness

Can small models learn to ground answers in context?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4