INQUIRING LINE

Do Doc2Query approaches suffer from the same misaligned-target problem?

This explores whether Doc2Query—which expands documents by predicting the queries they'd answer—inherits the same flaw seen elsewhere in retrieval: optimizing for a proxy target (likely-looking queries) instead of the real target (actual relevance).


This reads the question as asking whether Doc2Query, which fattens each document with machine-generated queries it might answer, falls into the same trap that haunts retrieval more broadly: it trains toward a stand-in goal rather than the thing you actually want. The corpus doesn't have a note on Doc2Query by name, but it maps the surrounding territory sharply enough to answer by analogy—and the answer is largely yes, with an interesting escape hatch.

The root problem the question names shows up most clearly in Where do retrieval systems fail and why?, which argues that embedding-based retrieval fails structurally because embeddings measure *association*, not *relevance*—two different targets that only sometimes line up. Doc2Query is one response to that mismatch: instead of changing how you match, you pre-write plausible queries onto the document so the surface vocabulary overlaps. But notice the sleight of hand—you've now made the document look like the queries a model *predicts*, not the queries real users *ask*. That's the misaligned-target problem moved one step upstream, from the matcher to the generator.

The cleanest sibling here is HyDE in Why do queries and documents occupy different embedding spaces?. HyDE is Doc2Query's mirror image: where Doc2Query expands documents toward hypothetical queries, HyDE expands queries toward hypothetical documents, then matches document-to-document. Both bet that a generated bridge beats a direct query-document comparison—and both inherit the risk that the generated text drifts toward what's *plausible* rather than what's *correct*. That drift is exactly the failure Do frontier LLMs silently corrupt documents in long workflows? documents in a different setting: models confidently produce content that's subtly off-target, and the error doesn't announce itself.

The most pointed challenge to Doc2Query's whole premise comes from Can fine-tuning replace query augmentation for retrieval?. Its claim is that if you fine-tune the retriever on implicit queries, it learns to resolve ambiguity internally—so you never need to bolt generated queries onto documents at all. In that framing, Doc2Query is a workaround for a weak retriever, and a workaround that introduces its own target-misalignment is worse than fixing the retriever directly. Can you adapt retrieval models without accessing target data? pushes the same way: synthetic training signal can adapt a model well when it's grounded in a real domain description, suggesting the fix is better *training* targets, not more *generated* surface text.

So the thing you might not have known you wanted to know: the field has quietly split between two ways to close the query-document gap—generate a bridge at inference time (Doc2Query, HyDE) or move the gap into the model's weights via training (fine-tuned retrieval). The generative bridge is cheap and label-free but always risks optimizing for a plausible-looking proxy; the training route is more expensive but targets relevance more directly. Doc2Query suffers the misaligned-target problem precisely *because* it chose the bridge.


Sources 5 notes

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Why do queries and documents occupy different embedding spaces?

HyDE resolves retrieval failures by generating plausible answer documents first, then matching those documents to the corpus using document-document similarity. This avoids the mismatch between query and document spaces without requiring labeled training data.

Can fine-tuning replace query augmentation for retrieval?

Fine-tuned semantic search models trained on implicit queries match the performance of augmented pretrained retrievers without expanding input length. The model learns to resolve ambiguity through training rather than requiring explicit augmentation.

Can you adapt retrieval models without accessing target data?

Research demonstrates that a brief textual domain description suffices to generate synthetic training data for retrieval fine-tuning, outperforming baselines in zero-target-access scenarios and enabling adaptation where conventional methods are blocked.

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a retrieval researcher re-evaluating the query-document alignment problem in 2025+. The question remains open: does Doc2Query (augmenting documents with machine-generated queries) escape the misaligned-target trap, or does it simply displace it?

What a curated library found — and when (dated claims, not current truth): Findings span 2022–2026.
• Doc2Query inherits a fundamental misalignment: it optimizes for *plausible-looking queries* (what the model predicts) rather than *user queries* (what's actually asked), shifting the target-mismatch problem upstream (2022–2023).
• HyDE mirrors this risk by generating hypothetical documents instead; both generative-bridge approaches risk drifting toward proxy optima rather than true relevance (2023).
• Fine-tuned retrieval (learning relevance directly via training) outperforms synthetic query augmentation and eliminates the need for Doc2Query altogether (2023–2024).
• Recent work (2025–2026) shows reasoning-augmented retrieval (chain-of-thought, adaptive reasoning, continuous latent reasoning) can resolve query-document gaps *within* model weights, further marginalizing inference-time augmentation (arXiv:2501.14342, 2508.06105, 2511.18659).
• LLMs systematically corrupt document content when delegated to generate or transform text, introducing silent failures that compound Doc2Query's plausibility drift (2026, arXiv:2604.15597).

Anchor papers (verify; mind their dates):
• arXiv:2212.10496 (2022) – Zero-shot dense retrieval without labels; establishes the baseline embedding-retrieval misalignment.
• arXiv:2307.02740 (2023) – Target-domain adaptation via description; shows training-based fixes outperform surface augmentation.
• arXiv:2501.14342 (2025) – Chain-of-Retrieval; demonstrates reasoning-in-weights supersedes generated bridges.
• arXiv:2604.15597 (2026) – LLM document corruption; flags silent degradation in any generative pipeline.

Your task:
(1) RE-TEST THE MISALIGNMENT CLAIM. Judge whether (a) newer retrieval models trained on reasoning signals (CoT, process supervision) have absorbed the query-document gap so thoroughly that Doc2Query now *degrades* performance; (b) orchestration advances (adaptive retrieval, self-correction loops, verifiable generation) have made Doc2Query's plausibility drift detectable and correctable; (c) synthetic data quality (via constitutional AI, grading, filtering) has improved enough to make generated queries *reliable* proxies. For each, cite what resolved (or failed to resolve) the constraint.
(2) Surface the strongest work from the last 6 months that *contradicts* Doc2Query's premise or shows a superior alternative (reasoning, training, or hybrid).
(3) Propose 2 open questions assuming the regime *has* shifted: (a) Can Doc2Query be salvaged if the generator is supervised to produce *user-distribution* queries rather than generic plausible ones? (b) Does combining Doc2Query with continuous latent reasoning (rather than discrete augmented text) retain the cheapness while eliminating the plausibility trap?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines