SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Language, Text, and Discourse

Why do queries and documents occupy different embedding spaces?

Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.

Synthesis note · 2026-02-22 · sourced from RAG
RAG How should researchers navigate LLM reasoning research?

The standard embedding retrieval pipeline maps a query directly to a vector and finds nearby document vectors. This assumes that a query and a relevant document occupy nearby regions of the embedding space. They often do not. Queries are short, telegraphic, and interrogative. Relevant documents are long, detailed, and declarative. The same information expressed in query form and document form looks different to an encoder trained on natural language co-occurrence.

HyDE (Hypothetical Document Embeddings) decomposes retrieval into two steps that exploit this asymmetry. First: ask an instruction-following LLM to generate a hypothetical document that would answer the query — not a real document, but something that looks like one. Second: embed the hypothetical document and use document-document similarity to find real corpus matches. The encoder, trained on documents-to-documents, now operates in its natural space.

The generated document may be factually wrong — it is, in the FLARE framing, a hallucination on purpose. But factual accuracy is not the goal. Relevance pattern is the goal. The hypothetical document "captures relevance by example": it demonstrates what a relevant document looks like in terms of style, terminology, and structure. The encoder's dense bottleneck filters out hallucinated details while preserving the embedding signature of relevant content.

The implication is that the query is the wrong level of abstraction for retrieval. Queries work well when they are complete enough to uniquely identify relevant content — which is why they succeed on short-form factoid QA but fail on complex or underspecified queries. Hypothetical documents circumvent this by translating the query into the same genre as the targets.

The approach requires no relevance labels and no retrieval-specific fine-tuning — only an instruction-following LLM and an unsupervised contrastive encoder. On 11 query sets spanning web search, question answering, and fact verification, HyDE with InstructGPT and Contriever significantly outperforms the zero-shot no-relevance baseline.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 152 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

query-document vocabulary mismatch makes direct embedding retrieval suboptimal — hypothetical document bridging resolves it