Can LLMs recommend items without seeing the product catalog?
This explores whether an LLM can suggest products it has never been shown — recommending without direct access to the item catalog — and what tricks make that possible.
This explores whether an LLM can suggest products it has never been shown — recommending without direct access to the item catalog — and the corpus says yes, but the interesting part is *how* it pulls this off. The cleanest demonstration is Rec-R1, where an LLM trained with reinforcement learning gets only the recommender system's success metrics as a reward signal and never sees the inventory. Over time it learns to write effective product-search queries anyway, picking up an implicit sense of what's in the catalog through feedback alone — much the way you learn to phrase Amazon searches without ever reading the full product list Can LLMs recommend products without ever seeing the catalog?.
But the corpus also gently pushes back on the premise that the LLM should be the one doing the recommending at all. One striking finding is that LLMs are often more valuable enriching the *inputs* — paraphrasing, summarizing, and categorizing item descriptions so a traditional recommender can rank them — than when asked to produce recommendations directly. The reason is telling: LLMs are great at understanding content but lack the specialized ranking instincts a dedicated recommender has Does LLM input augmentation beat direct LLM recommendation?. That reframes the question: maybe "recommend without the catalog" works best when the LLM never tries to know the catalog, and instead feeds its language understanding into a system that does.
This tension runs through the integration paradigms the corpus maps out. There are essentially three ways to plug an LLM in: feed its embeddings to a traditional recommender, have it emit semantic tokens, or let it recommend directly How should language models integrate into recommender systems?. When you *do* want the LLM closer to a large catalog, RecLLM lays out four distinct retrieval strategies — dual-encoder, direct LLM search, concept-based, and search-API lookup — each tuned to different corpus sizes and latency budgets How should LLM-based recommenders retrieve from massive item corpora?. So "without seeing the catalog" is really a spectrum: from generating a query a search system resolves, to hybrid setups like CoLLM that inject collaborative-filtering signals into the LLM's token space so it gains catalog-aware strength without losing its text understanding Can LLMs gain collaborative filtering strength without losing text understanding?.
The catch worth knowing is what the LLM smuggles in when it recommends blind. Without grounding in real inventory, it leans on patterns absorbed during pretraining — and those carry position, popularity, and fairness biases that don't come from any interaction data Where do recommendation biases come from in language models?. It may also confidently explain its picks using criteria that don't match how it actually chose them, a post-hoc justification rather than a true account Do LLM explanations faithfully describe their recommendation process?. One way to keep the fluency while regaining grounding: distill the LLM's knowledge offline into a product knowledge graph, so production systems serve catalog-accurate, low-latency recommendations with the LLM's insight baked in but its hallucinations pruned out Can we distill LLM knowledge into graphs for real-time recommendations?.
So the honest answer is: an LLM can absolutely recommend without holding the catalog in front of it — through learned query-writing, retrieval, or injected signals — but the best results tend to come from *not* asking it to be the catalog-keeper, and instead pairing its language sense with a system that knows what's actually on the shelves.
Sources 8 notes
Rec-R1 experiments show that LLMs trained via RL with recommender metrics as rewards can generate effective product search queries without catalog access. The model learns query refinement indirectly through system feedback, paralleling how humans search without knowing platform inventory.
Using LLMs to augment item descriptions with paraphrases, summaries, and categories—then feeding enriched text to traditional recommenders—beats asking LLMs to recommend directly. The mechanism: LLMs excel at content understanding but lack specialized ranking bias, so their textual enrichment is more valuable than their predictions.
Research identifies three patterns: LLM embeddings feeding traditional recommenders, LLM-generated semantic tokens for decision-making, and direct LLM-as-recommender. Each trades off compatibility, latency, bias exposure, and capability utilization differently.
RecLLM identifies four retrieval patterns—dual-encoder, direct LLM search, concept-based, and search-API lookup—each optimized for different corpus sizes, latency budgets, and training constraints. Hybrid approaches mixing multiple strategies likely work best for real systems.
CoLLM maps traditional collaborative filtering embeddings into the LLM's input token space, letting the LLM attend to CF signals alongside text without modification. This hybrid architecture maintains semantic understanding for cold items while gaining collaborative strength for warm interactions.
Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.
LLMs use additive utilitarian aggregation to generate group recommendations but explain the process using undefined popularity, similarity, and diversity metrics that don't match their actual behavior. Explanations become increasingly elaborate as item sets grow, suggesting post-hoc justification rather than truthful disclosure.
By distilling LLM knowledge into a product knowledge graph at offline time, systems can serve real-time recommendations with LLM-quality insights while meeting strict latency constraints. Rigorous evaluation and pruning mitigate hallucination risks before graph population.