Can long-context models resolve retriever-reader imbalance?
Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
Standard RAG retrieves 100-word paragraphs. This forces the retriever to locate the precise passage containing the answer across a corpus of potentially 22 million units. The task is "find the needle." The reader then extracts the answer from the found passage — a relatively easy task. The retriever carries almost all the weight.
This design was rational in the era when language models had 512–2048 token context windows. Longer retrieval units were unusable because the reader could not process them. The retriever had to do the precision work because the reader could not.
LongRAG (2024) reassesses this design choice given long-context LLMs that handle 128K tokens. Instead of 100-word units, use 4K-token units constructed by grouping related documents. The corpus shrinks from 22M to 600K units — the retriever's job becomes "find the right section" rather than "find the exact needle." Recall@1 on NQ improves from 52% to 71%, and Recall@2 on HotpotQA from 47% to 72%.
The reader then receives the top-k long units concatenated (~30K tokens) and performs zero-shot answer extraction. The LLM is handling what it is good at — understanding language in rich context — while the retriever handles what it is good at — coarse relevance ranking.
The broader principle: RAG architecture design assumptions were frozen at the constraints of their era. As those constraints lift (context windows, model capability, inference cost), the optimal design changes. "Best practices" based on 2020 constraints may be anti-patterns by 2025 standards.
Inquiring lines that use this note as a source 10
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do standard RAG systems struggle with pronouns and demonstratives?
- When does long-context LLM reasoning fail where structured retrieval succeeds?
- Can long-context readers handle compositional tasks or just semantic search?
- Could eliminating retrieval entirely work better than shifting the burden?
- Can context windows and RAG actually change what language models generate?
- What prompting strategies most effectively boost long-context LLM performance on retrieval?
- How do retrieved documents in RAG systems compound input length problems?
- How does separating local and global context dependencies affect long-context performance?
- Why do LLMs degrade on long inputs before hitting context limits?
- Does retrieval quality depend more on access structure or write gating?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
the reader doing more with longer context is analogous to more compute at inference enabling more capable responses
-
Does limiting reasoning per turn improve multi-turn search quality?
When language models engage in iterative search cycles, does capping reasoning at each turn—rather than just total compute—help preserve context for subsequent retrievals and improve overall search effectiveness?
related trade-off in the opposite direction: too much context per turn degrades iterative search; LongRAG should be scoped to single-turn answering, not iterative research
-
Can a single model replace retrieval for long-term conversation memory?
COMEDY proposes collapsing the standard retrieval pipeline into one unified model that generates, compresses, and responds. But does eliminating the retriever actually improve performance, or does compression lose critical information?
COMEDY takes the imbalance resolution further: rather than shifting burden to the reader, it eliminates the retriever entirely by merging retrieval and generation into a single compressive operation
-
Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
challenges the burden-shifting thesis: FLenQA shows reasoning accuracy drops from 0.92 to 0.68 at just 3000 tokens of irrelevant content; shifting burden to the reader assumes the reader can handle longer inputs, but reasoning degrades with length even far below context window limits
-
Can long-context LLMs replace retrieval-augmented generation systems?
Explores whether loading entire corpora into LLM context windows can eliminate the need for separate retrieval systems, and what task types this approach handles well or poorly.
LOFT validates the burden-shift for semantic tasks (LCLMs rival RAG systems) while exposing its limits: compositional SQL-like tasks require structured query logic that attention-based reading cannot provide, bounding where the heavy-reader approach works
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
complementary burden-shift in a different direction: LongRAG shifts work from retriever to reader at query time; sleep-time compute shifts work from query time to pre-query time; both respond to the same insight that the retriever-does-everything assumption is an artifact of historical context constraints, not an architectural necessity
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
- Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
- Faith and Fate: Limits of Transformers on Compositionality
- LongRAG: A Dual-Perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering
- Long-context LLMs Struggle with Long In-context Learning
- A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
- CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
- A Hybrid RAG System with Comprehensive Enhancement on Complex Reasoning
Original note title
heavy retriever / light reader imbalance is a historical artifact — long-context LLMs resolve it by shifting burden to the reader