Do Doc2Query approaches suffer from the same misaligned-target problem?
This explores whether Doc2Query—which expands documents by predicting the queries they'd answer—inherits the same flaw seen elsewhere in retrieval: optimizing for a proxy target (likely-looking queries) instead of the real target (actual relevance).
This reads the question as asking whether Doc2Query, which fattens each document with machine-generated queries it might answer, falls into the same trap that haunts retrieval more broadly: it trains toward a stand-in goal rather than the thing you actually want. The corpus doesn't have a note on Doc2Query by name, but it maps the surrounding territory sharply enough to answer by analogy—and the answer is largely yes, with an interesting escape hatch.
The root problem the question names shows up most clearly in Where do retrieval systems fail and why?, which argues that embedding-based retrieval fails structurally because embeddings measure *association*, not *relevance*—two different targets that only sometimes line up. Doc2Query is one response to that mismatch: instead of changing how you match, you pre-write plausible queries onto the document so the surface vocabulary overlaps. But notice the sleight of hand—you've now made the document look like the queries a model *predicts*, not the queries real users *ask*. That's the misaligned-target problem moved one step upstream, from the matcher to the generator.
The cleanest sibling here is HyDE in Why do queries and documents occupy different embedding spaces?. HyDE is Doc2Query's mirror image: where Doc2Query expands documents toward hypothetical queries, HyDE expands queries toward hypothetical documents, then matches document-to-document. Both bet that a generated bridge beats a direct query-document comparison—and both inherit the risk that the generated text drifts toward what's *plausible* rather than what's *correct*. That drift is exactly the failure Do frontier LLMs silently corrupt documents in long workflows? documents in a different setting: models confidently produce content that's subtly off-target, and the error doesn't announce itself.
The most pointed challenge to Doc2Query's whole premise comes from Can fine-tuning replace query augmentation for retrieval?. Its claim is that if you fine-tune the retriever on implicit queries, it learns to resolve ambiguity internally—so you never need to bolt generated queries onto documents at all. In that framing, Doc2Query is a workaround for a weak retriever, and a workaround that introduces its own target-misalignment is worse than fixing the retriever directly. Can you adapt retrieval models without accessing target data? pushes the same way: synthetic training signal can adapt a model well when it's grounded in a real domain description, suggesting the fix is better *training* targets, not more *generated* surface text.
So the thing you might not have known you wanted to know: the field has quietly split between two ways to close the query-document gap—generate a bridge at inference time (Doc2Query, HyDE) or move the gap into the model's weights via training (fine-tuned retrieval). The generative bridge is cheap and label-free but always risks optimizing for a plausible-looking proxy; the training route is more expensive but targets relevance more directly. Doc2Query suffers the misaligned-target problem precisely *because* it chose the bridge.
Sources 5 notes
RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.
HyDE resolves retrieval failures by generating plausible answer documents first, then matching those documents to the corpus using document-document similarity. This avoids the mismatch between query and document spaces without requiring labeled training data.
Fine-tuned semantic search models trained on implicit queries match the performance of augmented pretrained retrievers without expanding input length. The model learns to resolve ambiguity through training rather than requiring explicit augmentation.
Research demonstrates that a brief textual domain description suffices to generate synthetic training data for retrieval fine-tuning, outperforming baselines in zero-target-access scenarios and enabling adaptation where conventional methods are blocked.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.