INQUIRING LINE

What classifier accuracy is needed to assign memory roles reliably at retrieval time?

This reads the question as: if you build a system that tags retrieved memories by their *function* (clarifying, irrelevant, etc.) before using them, how good does that tagging classifier have to be before it helps rather than hurts — and the corpus reframes the question more than it answers it numerically.


This explores the accuracy bar for a classifier that sorts retrieved memories by their *role* at retrieval time — and the honest finding is that the corpus argues hard for *why* role-tagging matters while never pinning down a single accuracy threshold. The closest anchor is the observation that memory's functional role, not its raw relevance, drives conversational RAG quality: clarifying memory improves factual accuracy and constraint-awareness, while irrelevant memory *actively degrades* both Does retrieved memory quality depend on its functional role?. That asymmetry is the real answer to your question. Because a wrong role assignment doesn't just fail to help — it injects noise that pulls the model below a no-memory baseline — the tolerable error rate is set by how costly a false 'this is useful' tag is, not by some universal percentage.

The corpus does suggest the right way to *build* such a classifier, and it points away from heavy machinery. Adaptive-retrieval work shows that 27 lightweight external question features can match complex uncertainty-based methods on deciding *when* to retrieve, at a fraction of the cost Can question features alone predict when to retrieve?. Meanwhile calibrated token-probability uncertainty beats multi-call adaptive schemes on single-hop tasks Can simple uncertainty estimates beat complex adaptive retrieval?. The lesson that transfers to role-assignment: a cheap, well-calibrated gate often outperforms an elaborate one, so the question isn't 'how accurate' in the abstract but 'how well-calibrated' — a classifier that knows when it's unsure can route low-confidence cases conservatively (treat-as-irrelevant) and avoid the costly false positives that do the damage.

There's a deeper structural hint in the verification literature. A two-stage pipeline — cheap recall followed by a small learned verifier operating on full token-token similarity maps — reliably rejects 'structural near-misses' that compressed-vector matching waves through Can verification separate structural near-misses from topical matches?. This reframes your question: reliable role assignment may not be one accurate classifier at all, but a recall-then-verify cascade where the second stage's job is precisely to catch the confident-but-wrong cases. Accuracy targets become per-stage, and the binding constraint is the verifier's *precision on rejections*, not overall accuracy.

Framing the whole decision as a sequential policy rather than a one-shot label is the other path the corpus offers: modeling each reasoning step as a Markov Decision Process that learns when to lean on retrieved vs. internal knowledge yields a ~22% accuracy gain, largely by eliminating noise from unnecessary external knowledge When should language models retrieve external knowledge versus use internal knowledge?. Under that lens, 'classifier accuracy needed' dissolves into 'expected reward of the routing policy' — a mislabel is recoverable at later steps, so isolated-classifier accuracy understates and overstates the requirement at once.

So the thing you didn't know you wanted to know: there is no magic accuracy number because the corpus says the wrong target is being measured. What governs reliability is the *cost asymmetry of false positives*, the *calibration* of the gate, and whether role assignment is a single label or a recall-verify-route cascade. Build for cheap calibrated confidence and conservative fallback on uncertainty, and a moderately accurate classifier suffices; build for a single high-accuracy oracle and you'll still ship the failure mode where one confidently-mislabeled irrelevant memory poisons the response.


Sources 5 notes

Does retrieved memory quality depend on its functional role?

Retrieved memory type drives response quality more than relevance alone: clarifying memory improves factual accuracy and constraint awareness, while irrelevant memory actively degrades both. Role-aware retrieval and filtering are robustness requirements, not optional optimizations.

Can question features alone predict when to retrieve?

Learned predictors using 27 lightweight external question features match complex uncertainty-based methods on overall performance while costing far less, and outperform them on complex questions across 6 QA datasets.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Can verification separate structural near-misses from topical matches?

A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Next inquiring lines