How can gradients flow through discrete document selection?
This reads the question as the classic differentiability problem — picking a document is a hard, discrete choice (an argmax or top-k), which blocks gradients — and asks how systems learn through that step; the corpus addresses it less through gradient-estimator tricks and more through architectural decoupling, so I'll flag that framing up front.
This explores how a model can still be trained end-to-end when one step in the middle — choosing which document or item to pull — is a hard, discrete pick that ordinary backpropagation can't pass through. The honest answer first: this collection doesn't contain a paper that directly works the gradient-estimator angle (the Gumbel-softmax / straight-through tradition that makes the discrete choice 'soft' enough for gradients to leak through). What it does offer is the more durable design insight — that the cleanest way to get gradients past a discrete selection is often to move the learning somewhere the discreteness no longer blocks it.
The sharpest example is the discrete-codes line of work in recommendation. VQ-Rec maps an item's text to discrete codes via product quantization, and those codes then index a table of *learnable* embeddings Can discrete codes transfer better than text embeddings? Can discretizing text embeddings improve recommendation transfer?. The trick worth noticing: nobody backpropagates through the code assignment itself. The discrete step is a frozen lookup, and the gradient flows into the embedding table that the codes point at. So 'how do gradients flow through discrete selection?' gets answered by sidestepping — make the discrete part a fixed index, and put all the trainable weight on either side of it. That same decoupling is what gives the model its cross-domain transfer, because the lookup table can be re-fit per domain without disturbing the discrete code structure.
A second pattern in the corpus handles the problem by not trying to differentiate the selection at all — instead splitting recall from judgment. A two-stage retrieval pipeline does a cheap, non-differentiable first pass (pooled-cosine recall) and then trains a small Transformer verifier on the full token-to-token similarity map of the surviving candidates Can verification separate structural near-misses from topical matches?. The learning lives entirely in the verifier, which sees rich continuous signal; the hard selection upstream stays a plumbing decision, not something the loss has to reach through. This is the same move as the discrete-codes case, generalized: isolate the discrete choice so the trainable component always operates on continuous representations.
Worth pulling in laterally: conditioning *what* gets selected on a global view, rather than learning the selection through gradients, is another route the corpus takes. MiA-RAG summarizes a document first and conditions retrieval on that summary, recovering structure that surface-similarity selection destroys Can building a document map first improve retrieval over long texts?. And there's a cautionary note hovering over the whole question — LLMs turn out to fake iterative numerical procedures in latent space rather than actually executing them Do large language models actually perform iterative optimization? — a reminder that 'the gradient appears to be flowing and optimizing' is not the same as the model genuinely learning the discrete structure you hoped it would.
The thing you might not have expected to learn: the field's most transferable answer to 'flow gradients through a discrete choice' is frequently *don't* — quantize the choice into a fixed index, hang learnable embeddings off it, and let the gradient train those instead. The discreteness becomes a feature (cleaner transfer, less text-similarity bias) rather than an obstacle to route around.
Sources 5 notes
VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.
VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.
A two-stage pipeline—pooled-cosine recall followed by a small Transformer verifier operating on token-token similarity maps—reliably rejects structural near-misses that MaxSim-style late interaction cannot. The verifier succeeds because it operates on full token interaction patterns rather than compressed vectors.
MiA-RAG inverts standard RAG by summarizing documents first, then conditioning retrieval on that global view. This approach recovers discourse structure that bag-of-chunks retrieval destroys, making scattered evidence findable by their document role rather than surface similarity alone.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.