INQUIRING LINE

What are the 27 external features that predict retrieval need?

This explores a specific finding — that 27 cheap, surface-level features of a question alone can predict whether a system needs to go fetch outside information — and places it inside the larger corpus debate over how systems should decide when to retrieve at all.


This explores a specific result: that a handful of "external" question features — properties you can read off the question itself, before the model even tries to answer — can predict whether retrieval is worth doing. The corpus doesn't hand you a labeled list of all 27, but it tells you what kind of features they are and why they matter. They're called *external* and *lightweight* precisely because they don't require running the model and measuring its confidence: think question length, entity counts, question type (who/what/when), presence of rare or named terms, syntactic complexity — cheap signals computed from the text. The striking claim in Can question features alone predict when to retrieve? is that a learned predictor over these 27 features *matches* far more expensive uncertainty-estimation methods across six QA datasets, and actually *beats* them on complex questions — at a fraction of the cost.

Why does this matter beyond saving compute? Because it reframes the deciding-when-to-retrieve problem. The dominant alternative is to let the model introspect — generate an answer, measure how uncertain it is, and retrieve only when it's shaky. When should language models retrieve external knowledge versus use internal knowledge? takes the introspective route to its logical end, treating each reasoning step as a decision about whether to lean on internal knowledge or reach for external knowledge, and reports a ~22% accuracy gain mostly from *not* retrieving when retrieval would just add noise. The external-features result says: you can get much of that selectivity without the expensive self-interrogation — the question's surface already leaks whether it's the kind of thing the model knows.

The deeper reason this is interesting is that "when to retrieve" turns out to be one of the load-bearing failure points of RAG, not a tuning detail. Where do retrieval systems fail and why? names adaptive triggering as one of three *structural* failures — fixed-interval retrieval wastes context by fetching when nothing is needed and starving the model when something is. How should systems retrieve and reason with external knowledge? echoes that retrieval should adapt dynamically and couple tightly with reasoning rather than fire on a schedule. So the 27 features aren't a niche trick; they're one cheap answer to a problem the corpus treats as architectural.

There's a lateral tension worth seeing. The external-features approach keeps the *retriever* as a separate gatekeeper deciding from the outside. Other lines in the corpus argue the model itself should make the call: Can models decide better than retrievers which tools to use? shows models emitting their own structured requests for tools beats a passive retriever guessing, and Can retrieval learn what actually helps answer questions? pushes the decision inward by training the retriever on whether retrieved documents actually improved the answer. Read together, the question "what features predict retrieval need?" sits between two philosophies: predict it cheaply from the outside, or let the system learn it from the inside. The surprising takeaway is that the cheap outside view holds its own — and is hardest to beat exactly where it counts, on the complex questions.


Sources 6 notes

Can question features alone predict when to retrieve?

Learned predictors using 27 lightweight external question features match complex uncertainty-based methods on overall performance while costing far less, and outperform them on complex questions across 6 QA datasets.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

How should systems retrieve and reason with external knowledge?

Research shows retrieval should adapt dynamically rather than follow fixed patterns, reasoning and retrieval must integrate closely, and embedding-based retrieval has fundamental limits requiring architectural alternatives.

Can models decide better than retrievers which tools to use?

MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.

Can retrieval learn what actually helps answer questions?

CLaRa propagates generator loss back through continuous document representations, allowing retrievers to optimize for documents that actually improve answers rather than surface similarity. The gap between relevance and usefulness closes when retrieval receives direct feedback from generation success.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a RAG systems researcher evaluating whether lightweight question features can reliably predict retrieval need. The question remains open: which surface-level signals (question length, entity counts, type, rarity, syntax) actually generalize as retrieval triggers, and how do they compare to learned or model-introspective alternatives?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat as perishable constraints to re-test.
• A learned predictor over ~27 external question features (length, entity density, question type, named-entity presence, syntactic complexity) matches expensive uncertainty-estimation methods across six QA datasets and *beats* them on complex questions at fraction of the cost (~2025, arXiv:2505.04253).
• Model-driven introspective approaches (generate, measure uncertainty, retrieve conditionally) report ~22% accuracy gains by *not* retrieving when retrieval adds noise, treating each reasoning step as a separate retrieve/internal-knowledge decision (~2024–2025 path).
• Fixed-interval retrieval wastes context and starves the model; adaptive triggering is identified as a *structural* failure mode of RAG, not a tuning detail (~2024, arXiv:2404.16130).
• Proactive model-emitted tool requests and jointly trained retriever-generator systems shift the decision inward; tension exists between external gating (question-feature prediction) and internal learning (model decides when to retrieve).

Anchor papers (verify; mind their dates):
• arXiv:2505.04253 (2025-05): LLM-Independent Adaptive RAG — external features as predictor.
• arXiv:2507.09477 (2025-07): Towards Agentic RAG with Deep Reasoning — survey covering reasoning + retrieval coupling.
• arXiv:2604.16351 (2026-03): Training for Compositional Sensitivity — generalization of dense retrieval.
• arXiv:2404.16130 (2024-04): Graph RAG approach to summarization — structural adaptive retrieval.

Your task:
(1) RE-TEST EACH CONSTRAINT. Has the cost-performance trade-off between external-feature prediction and introspective uncertainty estimation shifted? Do newer foundational models (2025–2026) or multi-agent/orchestration frameworks (memory, caching, tool-calling SDKs) *relax* the need for lightweight features by making introspection cheaper or more reliable? Separate the durable insight — that question surface leaks retrieval need — from the perishable claim that 27 specific features are optimal.
(2) Surface the strongest *contradicting* work from the last 6 months: do newer agentic or reasoning-integrated RAG systems *learn* retrieval triggers so well that hand-engineered external features become obsolete?
(3) Propose 2 research questions assuming the retrieval-decision regime may have shifted: (a) Can zero-shot or in-context-learned feature importance rankings replace the fixed feature set? (b) Do multi-step reasoning chains and tool-use logs now implicitly encode better retrieval signals than the question alone?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines