Can you adapt retrieval models without accessing target data?
Explores whether dense retrieval systems can adapt to new domains using only a textual description, rather than actual target documents—especially relevant for privacy-restricted or competitive scenarios.
Dense retrieval models require labeled query-document pairs to adapt to new domains. In many enterprise contexts, the target collection is unavailable: it may not exist yet, it may be legally restricted (medical records, financial data), or sharing it with a model provider would compromise competitive advantage.
The standard assumption — you need the data to train for the domain — turns out to be false for retrieval. A brief textual description of the target domain is sufficient.
The pipeline: (1) Provide a textual domain description. (2) Use instruction-following LLMs to extract domain properties: document topics, linguistic attributes, source characteristics, terminology patterns. (3) Generate seed documents matching those properties. (4) Iteratively retrieve real-domain-like documents using the seed as query anchor. (5) Generate synthetic queries for the constructed collection. (6) Use pseudo-relevance labels to fine-tune the retrieval model.
The retrieval-augmented approach to domain understanding is key: at step (2), the domain description itself becomes a RAG query to extract structured properties, which are then used to parameterize generation at step (3). Bootstrapping from description through synthesis to training.
Evaluation on five diverse target domains shows that description-based adaptation outperforms existing dense retrieval baselines in the zero-target-access scenario. The approach enables adaptation in precisely the contexts where conventional adaptation is blocked: privacy-sensitive domains, legally restricted data, competitive scenarios.
Inquiring lines that use this note as a source 37
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can discrete codes and embedding injection both solve the text versus identity tradeoff?
- Can this distillation pattern apply beyond e-commerce to other latency-constrained domains?
- Do Doc2Query approaches suffer from the same misaligned-target problem?
- Can embedding tables be efficiently adapted per downstream domain?
- Why does text encoding create different subspaces across domains?
- Why do bi-encoder retrievers sacrifice effectiveness for latency in two-stage ranking?
- How does retrieval-augmented generation extract structured properties from domain descriptions?
- Why does capturing domain structure reduce data requirements more than raw volume?
- What access constraints allow description-based adaptation but block conventional techniques?
- What causes the retrieval-augmented generation to fail in practice?
- Why does domain-specific terminology require customization of vector search and generation?
- What makes web retrieval more effective than static knowledge bases?
- What makes retrieval augmentation more effective than simply increasing embedding size?
- How should query augmentation strategies be properly evaluated against baselines?
- What hidden costs might fine-tuning retrieval models introduce on out-of-distribution queries?
- Can semantic query expansion overcome vocabulary mismatch in corrupted text?
- Could eliminating retrieval entirely work better than shifting the burden?
- What documents improve answers beyond surface query similarity?
- How does retrieval-augmented generation create topically redundant content patterns?
- Can temporal ranking improve retrieval without modifying the underlying video model?
- Why does pure numeric ID indexing force models to learn from scratch?
- How can inference-time retrieval avoid the domain boundary problem?
- How does semantic mismatch between user language and API documentation degrade tool retrieval?
- Why do embedding-based retrieval systems fail on vocabulary mismatch?
- How do discrete item codes compare to text-based item indexing for transfer?
- How does retrieval-augmented training reduce domain specialization cliff failures?
- Can lookup tables transfer across domains better than text encoders?
- What design tradeoffs exist between pure ID and pure text indexing?
- Can Parfit's identity criteria apply to something that gets reconstituted from text data?
- How does reflection-based query refinement differ from single-pass retrieval strategies?
- How does description-based bridging compare to affordance-aware reranking for retrieval?
- Can the same description-then-retrieve pattern work for domain adaptation without target data?
- How do retrieval and fine-tuning trade off flexibility against training cost?
- Does retrieval quality depend more on access structure or write gating?
- Why does production retrieval augmented generation underperform in real deployments?
- What would instruction-following retrieval enable that query-only systems cannot?
- How much training data teaches retrieval models to follow instructions?
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does model access level determine which specialization techniques work?
Different specialization approaches require different levels of access to a model's internals. Understanding this constraint helps practitioners choose realistic techniques for their domain adaptation goals.
description-based adaptation enables white-box-style performance from a grey/black-box access constraint: you describe the domain without sharing the data
-
Can organizing knowledge structures beat raw training data volume?
Does structuring domain knowledge into taxonomies during training enable models to learn more efficiently than simply increasing the amount of training data? This challenges assumptions about scaling knowledge injection.
both show that structured domain knowledge (taxonomy or description) dramatically reduces data requirements; the key is capturing domain structure, not data volume
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Dense Retrieval Adaptation using Target Domain Description
- On the Theoretical Limitations of Embedding-Based Retrieval
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
- ZeroSearch: Incentivize the Search Capability of LLMs without Searching
- Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models
- Generator-Retriever-Generator: A Novel Approach to Open-domain Question Answering
- Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models
- Chain-of-Retrieval Augmented Generation
Original note title
domain adaptation for retrieval is possible without target collection via description-based synthetic data