INQUIRING LINE

How do time-based and entity-based queries differ from semantic similarity retrieval?

This explores why some questions — "what did we talk about Tuesday?" (time-based) or "how is this company connected to that one?" (entity-based) — can't be answered by finding text that *sounds* similar, and what the corpus says about the machinery each kind of query actually needs.


This explores why two kinds of questions — ones anchored to *when* something happened and ones anchored to *which entities relate to which* — break the usual retrieval trick of fetching passages that read most alike. Semantic similarity search works by turning text into vectors and grabbing whatever sits closest in that space. The recurring finding across the corpus is that closeness-in-meaning is the wrong axis for both temporal and relational questions, and not by a small margin you can tune away — it's a structural mismatch.

The cleanest case is time. A question like "what did we discuss Tuesday?" has almost no semantic content to match against; the answer is defined by a timestamp, not a topic. Why do time-based queries fail in conversational retrieval systems? frames this as a challenge that simply doesn't exist for ordinary semantic search: you need metadata indexing (and, for references like "tell me more about that," a disambiguation step that resolves *what* before you can retrieve at all). Embeddings have no native handle on "last Tuesday."

Entity-based queries fail for a different reason: the answer lives in *connections between things*, not in any single passage. When do graph databases outperform vector embeddings for retrieval? shows graph traversal beating vector similarity on aggregate and multi-hop relational questions, trading higher build cost for precision and completeness — because walking explicit edges (this supplier ships to that plant) is deterministic, while similarity search only gives you a probabilistic cloud of associated text. Can long-context LLMs replace retrieval-augmented generation systems? sharpens the line: stuffing everything into a long context can match RAG on semantic retrieval, yet still can't execute relational queries requiring joins across structured records. Context length doesn't buy you structure.

The deeper why is that embeddings measure the wrong thing in the first place. Do vector embeddings actually measure task relevance? argues they encode co-occurrence — what tends to appear near what — so role-distinct concepts come back as near-twins. Why do queries and their causes seem semantically different? makes the same point from another angle: when a student asks about "projection" after a remark, the semantically closest passage (on projection matrices) is the wrong one — the *cause* of the question is somewhere else entirely. Time and entity queries are just the most visible places where "sounds similar" and "is actually the answer" come apart.

What's worth taking away is that the field is converging on a portfolio view, not a winner: Where do retrieval systems fail and why? catalogs these as distinct architectural failure levels rather than one knob, and Can query-time graph construction replace pre-built knowledge graphs? (LogicRAG) hints at the synthesis — build relational structure *from the query at inference time* so you get graph-style logic without a pre-built graph. The interesting move isn't choosing semantic vs. structured retrieval; it's routing each question to the representation that matches how its answer is actually organized.


Sources 7 notes

Why do time-based queries fail in conversational retrieval systems?

Conversational memory faces two distinct retrieval challenges absent from static databases: time-based queries ("what did we discuss Tuesday?") requiring metadata indexing, and ambiguous references ("tell me more about that") requiring contextual disambiguation before retrieval.

When do graph databases outperform vector embeddings for retrieval?

Graph-oriented databases solve vector similarity's failure on aggregate queries by replacing probabilistic similarity search with deterministic graph traversal via Cypher. The tradeoff: higher construction cost but precision and completeness for enterprise use cases where query patterns are relational.

Can long-context LLMs replace retrieval-augmented generation systems?

The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Why do queries and their causes seem semantically different?

Backtracing—finding what caused a query—diverges from semantic similarity especially in conversation and lecture domains. Students ask about projection after hearing a specific statement, but the semantically closest passage discusses projection matrices instead, showing that surface similarity misses the actual cause.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Can query-time graph construction replace pre-built knowledge graphs?

LogicRAG constructs directed acyclic graphs from queries at inference time rather than pre-building corpus-wide graphs, eliminating construction overhead, avoiding staleness, and enabling query-specific retrieval logic without sacrificing multi-hop reasoning capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a retrieval systems researcher evaluating whether time-based and entity-based query failures versus semantic similarity are structural or contingent limitations. The question: do these three retrieval modes (semantic similarity, temporal indexing, relational graph traversal) represent genuinely different information-retrieval regimes, or has capability progress since mid-2024 begun to unify them?

What a curated library found—and when (findings span 2023–2026, but treat as dated claims to re-test):
• Semantic embeddings fail on temporal queries because they encode co-occurrence, not timestamps; metadata indexing required separately (2024-05, arXiv:2406.00057).
• Graph traversal outperforms vector similarity on multi-hop relational queries; "walking explicit edges" is deterministic while embeddings yield probabilistic clouds (inferred from 2024 path).
• Long-context LLMs can match RAG on semantic retrieval but still fail on relational queries requiring joins across structured records; context length alone does not buy structure (2024-06, arXiv:2406.13121).
• Embeddings measure semantic association (co-occurrence), not task relevance or causal relevance; the "wrong thing" is being optimized (2024-03, arXiv:2403.03956; 2025-08, arXiv:2508.21038).
• Inference-time query-to-graph construction hints at synthesis: derive relational structure from queries rather than pre-build graphs (2025-08, arXiv:2508.06105).

Anchor papers (verify; mind their dates):
- arXiv:2406.00057 (2024-05): Conversational memory and time-sensitive retrieval
- arXiv:2406.13121 (2024-06): Long-context subsumption of RAG and SQL
- arXiv:2403.03956 (2024-03): Backtracing causal relevance
- arXiv:2508.06105 (2025-08): Adaptive reasoning without pre-built graphs

Your task:
(1) RE-TEST each regime boundary. For temporal queries: have retrieval-augmented LLMs (Chain-of-Retrieval, RAG-R1) or improved memory architectures since 2025-01 begun to fold timestamp handling into learned retrieval routing, or does metadata indexing remain separable? For relational queries: do recent multi-query retrieval or compositional sensitivity training (arXiv:2604.16351, 2026-03) narrow the gap between graph traversal and embedding-based join-like behavior, or is the structural gap still irreducible?
(2) Surface the strongest work from the last ~6 months that either *contradicts* the regime separation (e.g., a unified retrieval model that handles all three), *supersedes* the need for graphs (e.g., learned routing that recovers relational reasoning from embeddings alone), or *complicates* the causal-vs.-semantic distinction (e.g., findings that causality is learnable or that co-occurrence is sufficient under different training regimes).
(3) Propose two research questions that assume the regime may have merged: (a) Can inference-time graph construction now be done efficiently enough to unify time, entity, and semantic retrieval under one learned router? (b) Do recent advances in embedding training (compositional sensitivity, concept frequency) allow a *single* vector space to encode temporal order, relational structure, and semantic similarity simultaneously?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines