Can smaller models handle RAG filtering while larger models focus on synthesis?
Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.
HiFi-RAG separates the RAG pipeline into stages handled by models of different capability and cost: a fast cheap model (Gemini 2.5 Flash) does query reformulation, prunes irrelevant retrieved passages, and attaches citations, while the large expensive model (Gemini 2.5 Pro) is invoked only at the final generation step. This is a tiering pattern with a specific theoretical justification: filtering and citation are pattern-matching tasks where the smaller model is sufficient, while final synthesis is where the large model's reasoning matters most.
The design implies a richer view of "RAG" than a single retrieve-then-generate pass. Each intermediate decision — which query to expand, which passages to keep, which spans to cite — has its own optimal cost-quality point, and forcing the most capable model to do all of them wastes compute on tasks where it offers no marginal benefit. The hierarchy also produces a useful side effect: because filtering happens before generation, the large model receives a smaller higher-quality context, which improves its answer quality even setting cost aside.
The general principle is that RAG architectures should think in terms of decision granularity rather than uniform model deployment. The retrieval pipeline contains several distinct sub-decisions, and matching each to an appropriately-sized model produces both cheaper and better answers — a Pareto improvement that uniform RAG misses because it treats retrieval and generation as a single coupled act. This is the RAG-specific instance of Can small language models handle most agent tasks? — heterogeneous tiered architectures are the economic imperative whenever subtasks have different capability requirements.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does the optimal model size depend on what capabilities you actually need?
- How should compute budgets be allocated across multi-stage RAG architectures?
- How do routers decide when to escalate from small to large models?
- Can multiple small models outperform a single large model with good routing?
- What happens when you project the same model onto different harnesses?
- How can expensive models efficiently support cheap models in production?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can small language models handle most agent tasks?
Explores whether smaller, cheaper models are actually sufficient for the repetitive, scoped work that dominates deployed agent systems, rather than relying on large models by default.
extends: same heterogeneous-architecture economic argument applied to RAG sub-decisions — filtering and citation are the SLM-suitable subtasks; final synthesis is the LLM-required subtask
-
Do hierarchical retrieval architectures outperform flat ones on complex queries?
Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
extends: structurally analogous; HierSearch separates planning/synthesis at the system level; HiFi-RAG separates filtering/synthesis at the model-tier level
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
extends: HiFi-RAG inverts the usual move — instead of more compute on a smaller model, it allocates the larger model only to the genuinely hard step; both treat compute-allocation as the lever rather than uniform scaling
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
extends: same adaptive allocation principle, applied to model selection across pipeline stages rather than to compute per query
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Revisiting RAG Ensemble: A Theoretical and Mechanistic Analysis of Multi-RAG System Collaboration
- LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
- MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries
- Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing
- A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
- Chain-of-Retrieval Augmented Generation
Original note title
hierarchical RAG splits filtering from generation across model tiers — small models prune and cite while large models only synthesize the final answer