SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Agentic Systems and Tool Use

Can smaller models handle RAG filtering while larger models focus on synthesis?

Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.

Synthesis note · 2026-05-03
How should retrieval and reasoning integrate in RAG systems?

HiFi-RAG separates the RAG pipeline into stages handled by models of different capability and cost: a fast cheap model (Gemini 2.5 Flash) does query reformulation, prunes irrelevant retrieved passages, and attaches citations, while the large expensive model (Gemini 2.5 Pro) is invoked only at the final generation step. This is a tiering pattern with a specific theoretical justification: filtering and citation are pattern-matching tasks where the smaller model is sufficient, while final synthesis is where the large model's reasoning matters most.

The design implies a richer view of "RAG" than a single retrieve-then-generate pass. Each intermediate decision — which query to expand, which passages to keep, which spans to cite — has its own optimal cost-quality point, and forcing the most capable model to do all of them wastes compute on tasks where it offers no marginal benefit. The hierarchy also produces a useful side effect: because filtering happens before generation, the large model receives a smaller higher-quality context, which improves its answer quality even setting cost aside.

The general principle is that RAG architectures should think in terms of decision granularity rather than uniform model deployment. The retrieval pipeline contains several distinct sub-decisions, and matching each to an appropriately-sized model produces both cheaper and better answers — a Pareto improvement that uniform RAG misses because it treats retrieval and generation as a single coupled act. This is the RAG-specific instance of Can small language models handle most agent tasks? — heterogeneous tiered architectures are the economic imperative whenever subtasks have different capability requirements.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 122 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

hierarchical RAG splits filtering from generation across model tiers — small models prune and cite while large models only synthesize the final answer