INQUIRING LINE

How does selective history retrieval improve conversational search accuracy?

This explores why picking out the *relevant* parts of a conversation's history beats stuffing in everything — and what the corpus says about which parts are worth keeping.


This explores why picking out the relevant parts of a conversation's history beats dumping in the whole transcript. The most direct answer in the collection is also the most counterintuitive: more context is not better. Automatically selecting which previous turns matter improves retrieval more than including all of it — and it even beats hand-picking by human annotators when selection and retrieval are trained together Does including all conversation history actually help retrieval?. The reason is that conversations don't stay on one topic. When a user switches subjects, the old turns become noise that pulls the retriever toward irrelevant matches. Selection works because it strips that noise before it can poison the query.

But "select the relevant turns" hides a harder problem: relevant *how*? The corpus suggests plain semantic similarity isn't enough. Conversational memory faces two challenges a static search index never has — time-anchored questions like "what did we discuss Tuesday?" that need metadata rather than meaning-matching, and dangling references like "tell me more about that" that have to be resolved to a concrete subject *before* you can retrieve anything Why do time-based queries fail in conversational retrieval systems?. So part of how selective retrieval improves accuracy is by recognizing that some queries aren't semantic queries at all, and routing them differently.

There's a deeper twist worth knowing: the best thing to retrieve may not be raw history at all. One line of work finds that abstracted preference *summaries* — a compressed portrait of what the user tends to want — consistently beat pulling up specific past interactions, and that recency-based recall outperforms similarity-based recall Does abstract preference knowledge outperform specific interaction recall?. That reframes "selective history retrieval" as a spectrum: select the right turns, or distill the turns into knowledge and retrieve that instead. Pushed to the extreme, some systems try to fold memory generation, compression, and response into a single model and skip the retrieval step entirely — though that path is fragile, degrading below even a no-memory baseline when it overprocesses and misgroups what it stored Can a single model replace retrieval for long-term conversation memory?.

The recommendation side of the collection adds a useful contrast. There, the lesson runs the opposite direction: conversational recommenders often use *too little* history, leaning only on the active session and losing valuable signal from past dialogues and similar users Can conversational recommenders recover lost preference signals from history?. Put next to the search findings, the real principle emerges — accuracy comes not from more history or less history, but from selecting the right *channels* of it and conditioning that selection on the user's current intent. Selection is the act of matching what you pull to what the user means right now.

If you follow one thread further, start with the temporal-and-reference problem Why do time-based queries fail in conversational retrieval systems? — it's the cleanest illustration of why semantic search alone quietly fails in conversation, and why "selective" has to mean smart, not just smaller.


Sources 5 notes

Does including all conversation history actually help retrieval?

Research shows that automatically selecting relevant previous turns improves retrieval effectiveness more than including all context. Topic switches inject irrelevant information; joint optimization of selection and retrieval beats both full-context baselines and human annotation.

Why do time-based queries fail in conversational retrieval systems?

Conversational memory faces two distinct retrieval challenges absent from static databases: time-based queries ("what did we discuss Tuesday?") requiring metadata indexing, and ambiguous references ("tell me more about that") requiring contextual disambiguation before retrieval.

Does abstract preference knowledge outperform specific interaction recall?

PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.

Can a single model replace retrieval for long-term conversation memory?

COMEDY merges memory generation, compression, and response into one operation, tracking event recaps, user portraits, and relationship dynamics without vector-DB retrieval. However, empirical work shows continuous reprocessing follows an inverted-U curve, degrading below no-memory baseline due to misgrouping, context loss, and overfitting.

Can conversational recommenders recover lost preference signals from history?

Current CRS systems only use the active dialogue session to infer preferences, losing item-CF and user-CF signals proven valuable in traditional recommenders. Integrating current session, historical dialogues, and look-alike users—conditioned on current intent—recovers essential user representation structure.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a conversational search researcher evaluating whether selective history retrieval still improves accuracy in 2025+. The question: Does filtering prior turns outperform full-context inclusion, and if so, why?

What a curated library found — and when (dated claims, not current truth):
— Automatic turn selection beats both full-context and human annotation when selection and retrieval are trained jointly (2023–24).
— Conversational queries split into semantic (similarity-based) and non-semantic (temporal anchors, dangling references) types; semantic selection alone fails on the latter (~2023–24, arXiv:2307.16689).
— Abstracted preference summaries and recency-based recall outperform episodic turn retrieval; compressive memory degrades below no-memory baselines when it misgroups stored content (2024–25, arXiv:2402.11975).
— Recommender systems underuse historical signals; the principle is channel-matching, not volume (2021–24).

Anchor papers (verify; mind their dates):
— arXiv:2306.02553 (2023, conversational search turn selection)
— arXiv:2307.16689 (2023, temporal & reference challenges)
— arXiv:2402.11975 (2024, compressive memory trade-offs)
— arXiv:2507.04607 (2025, long-term memory personalization)

Your task:
(1) RE-TEST EACH CONSTRAINT. For semantic/non-semantic split: have newer retrievers or routing methods (LLM-as-judge, intent classifiers) since relaxed the distinction or made it harder to exploit? For compressive memory: do 2025 models with larger context windows and better compression (e.g., via structured summarization) still degrade, or has scaling solved it? For abstraction-vs-episodic: do adaptive memory systems now blend both dynamically? Flag which constraints appear durable.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work in the last 6 months — especially anything that shows full context, multi-agent coordination (arXiv:2508.18167), or proactive critique (arXiv:2507.23407) re-inverting the "less is more" finding.
(3) Propose 2 research questions assuming the regime has shifted: (a) Does multi-turn agentic orchestration (where agents delegate which turns to pass to each other) outperform single-model selection? (b) Can in-context learning over a few diverse dialogues teach a model when to ignore history better than training-time selection?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines