Can document count be learned instead of fixed in RAG?
Standard RAG systems use a fixed number of documents regardless of query complexity. Can an RL agent learn to dynamically select both how many documents and their order based on what helps the generator produce correct answers?
Every standard RAG re-ranking system passes a fixed k documents to the generator. The k is set by the system designer and held constant across queries. This is wrong in both directions: too few documents omit critical information for complex queries; too many documents introduce noise that misleads the generator and reduces efficiency.
The k selection problem is unsolved by all pre-DynamicRAG re-ranking approaches. Re-rankers have improved document ordering but assumed k was given. The number of documents to retrieve is treated as a hyperparameter, not a learned decision.
DynamicRAG models the reranker as an RL agent whose action space is a permutation and count selection over retrieved documents. The reward is LLM output quality — specifically, whether the generator produces a correct answer given the selected document set. The agent receives both explicit query signals and the generator's feedback.
Training proceeds in two phases. First, behavior cloning on expert trajectories (SFT) gives the reranker a baseline policy and reduces action space complexity. Second, RL with generator feedback allows the reranker to explore and learn to calibrate both ordering and count to query needs.
The insight generalizes beyond re-ranking: any RAG system parameter that is currently a heuristic (chunk size, retrieval depth, context window allocation) is a candidate for learning via generator feedback. The generator's output quality is a reward signal that can backpropagate through any component of the pipeline that affects what the generator receives.
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What techniques enable RAG systems to handle heterogeneous data formats at scale?
- How do retrieved documents in RAG systems compound input length problems?
- Can other RAG hyperparameters like chunk size be learned through generator feedback?
- Why do RAG systems fail when demo queries work correctly?
- What threshold combinations for uncertainty and rarity signals maximize RAG performance?
- What five requirements do enterprise RAG systems need beyond accuracy?
- What concrete failures happen when RAG ignores temporal relevance?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
same adaptive allocation principle applied to document selection; optimal k depends on query, not system configuration
-
Can retrieval learn what actually helps answer questions?
Standard RAG trains retrievers to find similar documents and generators to produce answers separately. But does surface similarity match what genuinely helps generate correct responses? This explores whether retrieval can receive feedback from answer quality.
CLaRa addresses the same generator-feedback problem via continuous representations; DynamicRAG addresses it via RL
-
Does RL improve domain reasoning by adding knowledge or removing it?
When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.
same RL mechanism at a different level: RL prunes wrong reasoning paths in domain contexts, DynamicRAG prunes wrong document selections; in both cases RL refines an existing process by suppressing suboptimal choices rather than adding new capability
-
Does supervising retrieval steps outperform final answer rewards?
Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
complementary RL approaches to RAG: DynamicRAG learns document count/order via RL, RAG-Gym learns intermediate retrieval step quality via process supervision; together they show RL can optimize both the what-to-include and how-to-retrieve aspects of RAG
-
Can rationale-driven selection beat similarity re-ranking for evidence?
Can LLMs generate search guidance that outperforms traditional similarity-based evidence ranking? This matters because current re-ranking lacks interpretability and fails against adversarial attacks.
both solve fixed-k but via different mechanisms: DynamicRAG learns adaptive k through RL with generator feedback, METEORA eliminates k via rationale-match elbow detection; DynamicRAG is training-time optimization, METEORA is inference-time architecture
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation
- You Don't Need Pre-built Graphs for RAG: Retrieval Augmented Generation with Adaptive Reasoning Structures
- RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation
- RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
- Useful Memories Become Faulty When Continuously Updated by LLMs
- Chain-of-Retrieval Augmented Generation
- RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation
- UR2: Unify RAG and Reasoning through Reinforcement Learning
Original note title
rl-trained reranker that adjusts document order and count solves the fixed top-k problem in rag