Can smaller models handle RAG filtering while larger models focus on synthesis?

Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.

Synthesis note · 2026-05-03

HiFi-RAG separates the RAG pipeline into stages handled by models of different capability and cost: a fast cheap model (Gemini 2.5 Flash) does query reformulation, prunes irrelevant retrieved passages, and attaches citations, while the large expensive model (Gemini 2.5 Pro) is invoked only at the final generation step. This is a tiering pattern with a specific theoretical justification: filtering and citation are pattern-matching tasks where the smaller model is sufficient, while final synthesis is where the large model's reasoning matters most.

The design implies a richer view of "RAG" than a single retrieve-then-generate pass. Each intermediate decision — which query to expand, which passages to keep, which spans to cite — has its own optimal cost-quality point, and forcing the most capable model to do all of them wastes compute on tasks where it offers no marginal benefit. The hierarchy also produces a useful side effect: because filtering happens before generation, the large model receives a smaller higher-quality context, which improves its answer quality even setting cost aside.

The general principle is that RAG architectures should think in terms of decision granularity rather than uniform model deployment. The retrieval pipeline contains several distinct sub-decisions, and matching each to an appropriately-sized model produces both cheaper and better answers — a Pareto improvement that uniform RAG misses because it treats retrieval and generation as a single coupled act. This is the RAG-specific instance of Can small language models handle most agent tasks? — heterogeneous tiered architectures are the economic imperative whenever subtasks have different capability requirements.

Inquiring lines that read this note 7

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

When does architectural design matter more than raw model capacity?

How do knowledge injection methods compare across cost and effectiveness?

How should compute budgets be allocated across multi-stage RAG architectures?

Can model routing outperform monolithic scaling as an efficiency strategy?

Do harness improvements transfer across model scales or memorize shortcuts?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Can smaller models handle RAG filtering while la… Can small language models handle most agent tasks? Do hierarchical retrieval architectures outperform… Can inference compute replace scaling up model siz… Can we allocate inference compute based on prompt …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can small language models handle most agent tasks? Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
extends: same heterogeneous-architecture economic argument applied to RAG sub-decisions — filtering and citation are the SLM-suitable subtasks; final synthesis is the LLM-required subtask
Do hierarchical retrieval architectures outperform flat ones on complex queries? Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
extends: structurally analogous; HierSearch separates planning/synthesis at the system level; HiFi-RAG separates filtering/synthesis at the model-tier level
Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
extends: HiFi-RAG inverts the usual move — instead of more compute on a smaller model, it allocates the larger model only to the genuinely hard step; both treat compute-allocation as the lever rather than uniform scaling
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
extends: same adaptive allocation principle, applied to model selection across pipeline stages rather than to compute per query

Can smaller models handle RAG filtering while larger models focus on synthesis?

Inquiring lines that read this note 7

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4