← All notes

Where do retrieval systems fail and why?

Examines how and why retrieval layers in RAG systems fail, from embedding limits to architectural mismatches.

Topic Hub · 28 linked notes · 9 sections
View as

Retrieval Mechanics

6 notes

When should retrieval happen during model generation?

Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.

Explore related Read →

Can simple uncertainty estimates beat complex adaptive retrieval?

Does measuring a language model's own confidence on token probabilities outperform expensive multi-call adaptive retrieval pipelines? This matters because it could simplify RAG systems while reducing computational overhead.

Explore related Read →

Why do queries and documents occupy different embedding spaces?

Queries and documents express the same information in fundamentally different ways—short and interrogative versus long and declarative. Understanding this mismatch is crucial for why direct embedding retrieval often fails.

Explore related Read →

Can fine-tuning replace query augmentation for retrieval?

Query augmentation helps retrievers handle ambiguous queries but increases input cost. Does fine-tuning the retrieval model achieve comparable performance without this overhead?

Explore related Read →

Can long-context models resolve retriever-reader imbalance?

Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?

Explore related Read →

Can a model's partial response guide what to retrieve next?

Does using the model's in-progress output as a retrieval signal reveal information needs better than the original query alone? This explores whether generation itself can diagnose what documents are missing.

Explore related Read →

Failure Modes

5 notes

Do vector embeddings actually measure task relevance?

Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?

Explore related Read →

Can long-context LLMs replace retrieval-augmented generation systems?

Explores whether loading entire corpora into LLM context windows can eliminate the need for separate retrieval systems, and what task types this approach handles well or poorly.

Explore related Read →

When do graph databases outperform vector embeddings for retrieval?

Vector similarity struggles with aggregate and relational queries that require traversing multiple entity connections. Can graph-oriented databases with deterministic queries solve this failure mode in enterprise domain applications?

Explore related Read →

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Explore related Read →

Do embedding dimensions fundamentally limit retrievable document combinations?

Can single-vector embeddings represent any top-k document subset a user might need? Research using communication complexity theory suggests there are hard geometric limits independent of training data or model architecture.

Explore related Read →

Encoder Architecture

1 note

Query Routing and Knowledge Structure Selection

1 note

Can routing queries to task-matched structures improve RAG reasoning?

Does matching retrieval structure type to task demands—tables for analysis, graphs for inference, algorithms for planning—improve reasoning accuracy over uniform chunk retrieval? This explores whether cognitive fit principles from human learning transfer to AI systems.

Explore related Read →

Pass 3 Additions (2026-05-03)

7 notes

Can pretraining data statistics detect hallucinations better than model confidence?

Explores whether checking whether entity combinations appeared in training data is a more reliable hallucination signal than measuring the model's own confidence levels, especially for catching confidently-wrong outputs.

Explore related Read →

Should RAG systems use model confidence or data rarity to trigger retrieval?

Internal uncertainty and pretraining-data rarity signals catch different failure modes in RAG. This explores whether one signal suffices or both are needed to prevent hallucination across different failure types.

Explore related Read →

How can video retrieval handle multiple modalities at different times?

Video RAG systems struggle because the same content appears across visual, audio, and subtitle tracks at offset timestamps. Can temporal awareness in text ranking and frame sampling solve cross-modal misalignment?

Explore related Read →

Can RAG systems refuse to answer without reliable evidence?

Explores whether retrieval-augmented generation can be designed to abstain from answering when sources are corrupted or insufficient, rather than filling gaps with plausible-sounding guesses. This matters for historical text where OCR errors and language drift are common.

Explore related Read →

Can we defend RAG systems from corpus poisoning without retraining?

Explores whether retrieval-time defenses can catch and block poisoned documents before they reach the generator, without expensive retraining cycles. Matters because corpus updates outpace model retraining in production RAG systems.

Explore related Read →

Why do queries and their causes seem semantically different?

Information retrieval systems find passages matching query language, but what if the segment that actually caused a user's question says something quite different? This explores when semantic similarity fails to find causal relevance.

Explore related Read →

How should LLM-based recommenders retrieve from massive item corpora?

When conversational recommenders need to search millions of items, the LLM cannot memorize the corpus. What retrieval strategies work best under different constraints, and how do they trade off latency, sample efficiency, and scalability?

Explore related Read →

Backlog wave 2 — Batch #3 *(2026-06-03)*

1 note

Can question features alone predict when to retrieve?

Can lightweight external features of a question—rather than expensive model uncertainty checks—reliably decide whether retrieval is needed? This matters because uncertainty-based methods promise efficiency but add computation.

Explore related Read →

Backlog wave 2 — Batch #3 *(2026-06-03)*

1 note

Can retrieval systems ground answers in the right time?

Explores whether document retrieval for language models can distinguish between multiple versions of the same content from different time points, and whether adding temporal awareness to retrieval scoring helps answer time-sensitive questions accurately.

Explore related Read →

Backlog wave 3 — Batch #4 *(2026-06-03)*

2 notes

Does cosine similarity actually measure embedding similarity?

Cosine similarity is ubiquitous for comparing learned embeddings, but does it reliably capture semantic closeness? This work investigates whether regularization during training makes cosine scores arbitrary and unstable.

Explore related Read →

Do retrieval models actually follow natural language instructions?

Most IR systems ignore instructions that define relevance, despite using LLM backbones. This raises questions about whether retrievers can adapt to nuanced user-specified information needs in practice.

Explore related Read →