Can you adapt retrieval models without accessing target data?

Explores whether dense retrieval systems can adapt to new domains using only a textual description, rather than actual target documents—especially relevant for privacy-restricted or competitive scenarios.

Synthesis note · 2026-02-22 · sourced from RAG

Dense retrieval models require labeled query-document pairs to adapt to new domains. In many enterprise contexts, the target collection is unavailable: it may not exist yet, it may be legally restricted (medical records, financial data), or sharing it with a model provider would compromise competitive advantage.

The standard assumption — you need the data to train for the domain — turns out to be false for retrieval. A brief textual description of the target domain is sufficient.

The pipeline: (1) Provide a textual domain description. (2) Use instruction-following LLMs to extract domain properties: document topics, linguistic attributes, source characteristics, terminology patterns. (3) Generate seed documents matching those properties. (4) Iteratively retrieve real-domain-like documents using the seed as query anchor. (5) Generate synthetic queries for the constructed collection. (6) Use pseudo-relevance labels to fine-tune the retrieval model.

The retrieval-augmented approach to domain understanding is key: at step (2), the domain description itself becomes a RAG query to extract structured properties, which are then used to parameterize generation at step (3). Bootstrapping from description through synthesis to training.

Evaluation on five diverse target domains shows that description-based adaptation outperforms existing dense retrieval baselines in the zero-target-access scenario. The approach enables adaptation in precisely the contexts where conventional adaptation is blocked: privacy-sensitive domains, legally restricted data, competitive scenarios.

Inquiring lines that read this note 38

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do semantic similarity and task relevance diverge in vector embeddings?

How can LLM recommenders match or exceed collaborative filtering performance?

How should retrieval systems optimize for multi-step reasoning during inference?

How does example difficulty affect learning efficiency in language models?

Why does capturing domain structure reduce data requirements more than raw volume?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

What access constraints allow description-based adaptation but block conventional techniques?

When should retrieval-augmented systems decide to fetch new information?

How do knowledge injection methods compare across cost and effectiveness?

What makes specific clarifying questions more effective than generic ones?

What documents improve answers beyond surface query similarity?

How can identical external performance mask different internal representations?

Why does pure numeric ID indexing force models to learn from scratch?

Does domain specialization cause models to lose capabilities elsewhere?

How does retrieval-augmented training reduce domain specialization cliff failures?

Why do persona-level simulations fail to predict individual preferences accurately?

Can Parfit's identity criteria apply to something that gets reconstituted from text data?

How should iterative research systems allocate reasoning per search step?

How does reflection-based query refinement differ from single-pass retrieval strategies?

How does memorization interact with learning and generalization?

How much training data teaches retrieval models to follow instructions?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 98 in 2-hop network ·medium cluster Open in graph ↗

Can you adapt retrieval models without accessing… Does model access level determine which specializa… Can organizing knowledge structures beat raw train…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does model access level determine which specialization techniques work? Different specialization approaches require different levels of access to a model's internals. Understanding this constraint helps practitioners choose realistic techniques for their domain adaptation goals.
description-based adaptation enables white-box-style performance from a grey/black-box access constraint: you describe the domain without sharing the data
Can organizing knowledge structures beat raw training data volume? Does structuring domain knowledge into taxonomies during training enable models to learn more efficiently than simply increasing the amount of training data? This challenges assumptions about scaling knowledge injection.
both show that structured domain knowledge (taxonomy or description) dramatically reduces data requirements; the key is capturing domain structure, not data volume

Can you adapt retrieval models without accessing target data?

Inquiring lines that read this note 38

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4