INQUIRING LINE

Can models retrieve the right tool without relying on vector similarity?

This explores whether an LLM can pick the correct tool to call without leaning on embedding/vector-similarity matching — and what the corpus offers as alternatives.


This explores whether models can select the right tool without leaning on vector similarity — and the corpus has a surprisingly rich set of escape routes from embedding matching. The starting problem is sharp: vector embeddings measure *semantic association*, not *task relevance* Do vector embeddings actually measure task relevance?. They encode co-occurrence, so concepts that are semantically close but play completely different roles look nearly identical. That's fine in a demo and quietly broken in production, where an underspecified query has many wrong-but-associated candidates the embedding happily ranks high.

The most direct answer is to flip who's in charge. Instead of a retriever passively matching a query to tool descriptions, let the model itself emit structured tool requests and refine them across turns Can models decide better than retrievers which tools to use?. This sidesteps the colloquial-to-formal vocabulary gap that sinks single-round semantic matching — the model reasons its way to the requirement rather than hoping its phrasing lands near the right embedding. A related move replaces similarity ranking with *reasoning about relevance*: generating rationales for why a piece of evidence matters beats similarity re-ranking by a third while using half as many chunks Can rationale-driven selection beat similarity re-ranking for evidence?. The lesson generalizes from evidence selection to tool selection — "why is this relevant" is a different and better question than "what looks similar."

There are also structural alternatives to similarity search entirely. When the relationships between things are what matters, deterministic graph traversal beats probabilistic vector lookup — you query the structure with something like Cypher instead of nearest-neighbor guessing When do graph databases outperform vector embeddings for retrieval?. And the model's own uncertainty turns out to be a better signal than external retrieval heuristics for *whether* to reach for a tool at all: calibrated token-probability uncertainty beats multi-call adaptive retrieval at a fraction of the compute Can simple uncertainty estimates beat complex adaptive retrieval?. The model's self-knowledge is doing the routing.

What you didn't ask but might want to know: similarity itself isn't the villain — *learned* similarity can be. A properly tuned dot product beats an MLP trained to imitate one, because the dot product carries a structural inductive bias the MLP has to discover from scratch Why does dot product beat MLP-based similarity in practice?. So the real fault line isn't "vector vs. not-vector," it's whether your matching mechanism encodes the right notion of relevance. Two more threads round this out: small models can be trained to call functions reliably through preference pairs that teach them what a *wrong* call looks like, not just what's plausible Can small models match large models on function calling?; and models can learn to operate over an inventory they never directly retrieve over, picking the right action through closed-loop feedback rather than lookup Can LLMs recommend products without ever seeing the catalog?. Across all of these, the through-line is the same: relevance is a reasoning and feedback problem, and vector similarity is only one — often the weakest — way to approximate it.


Sources 8 notes

Do vector embeddings actually measure task relevance?

Embeddings encode co-occurrence patterns, making semantically close but role-distinct concepts highly similar. This works in simple demos but fails in production where underspecified queries have many wrong-but-associated candidates.

Can models decide better than retrievers which tools to use?

MCP-Zero shows that letting models emit structured tool requests iteratively across conversations outperforms single-round semantic matching. The model can refine requirements progressively across domains as reasoning unfolds, bypassing colloquial-to-formal vocabulary mismatch.

Can rationale-driven selection beat similarity re-ranking for evidence?

METEORA uses LLM-generated rationales with flagging instructions to select evidence, achieving 33% better accuracy with 50% fewer chunks than similarity re-ranking across legal, financial, and academic domains. The method also improves adversarial robustness substantially.

When do graph databases outperform vector embeddings for retrieval?

Graph-oriented databases solve vector similarity's failure on aggregate queries by replacing probabilistic similarity search with deterministic graph traversal via Cypher. The tradeoff: higher construction cost but precision and completeness for enterprise use cases where query patterns are relational.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Why does dot product beat MLP-based similarity in practice?

Rendle et al. show properly-tuned dot products substantially beat MLP-based similarity despite MLP universality. Learning a dot product with an MLP requires large models and datasets; dot products also enable efficient retrieval at production scale through MIPS algorithms.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can LLMs recommend products without ever seeing the catalog?

Rec-R1 experiments show that LLMs trained via RL with recommender metrics as rewards can generate effective product search queries without catalog access. The model learns query refinement indirectly through system feedback, paralleling how humans search without knowing platform inventory.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **Can models retrieve the right tool without relying on vector similarity?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2017–2025. Core constraints the library identified:
- Vector embeddings encode semantic association, not task relevance, causing production failures with underspecified queries (~2024).
- Single-round similarity matching loses to rationale-driven selection by ~33% while using half the chunks (~2024).
- Learned similarity (MLPs) underperform structural dot-product because they lack inductive bias (~2024).
- Small models trained via DPO can match large models on function-calling reasoning (~2024).
- Model uncertainty beats heuristic-based adaptive retrieval at lower compute cost (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2508.21038 (Aug 2025) — theoretical limits of embedding-based retrieval
- arXiv:2506.01056 (Jun 2025) — proactive toolchain construction without retrieval
- arXiv:2410.18890 (Oct 2024) — small-model function-calling via DPO
- arXiv:2501.12835 (Jan 2025) — uncertainty-driven adaptive retrieval

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each finding above, determine whether newer models (GPT-4o, Claude 3.5, o1-class reasoners), training methods (test-time compute scaling, synthetic preference data), tool-calling infrastructure (OpenAI/Anthropic native APIs, MCP improvements), or evaluation suites have since relaxed or overturned the constraint. Separate what's durable (the reasoning problem) from what's resolved (the mechanism). Flag where similarity still fails and what replaced it.
(2) **Surface the strongest contradicting or superseding work from the last ~6 months.** Has embedding-based retrieval made a surprising comeback? Are there new hybrid architectures?
(3) **Propose 2 research questions that assume the regime may have shifted.** E.g., does native model function-calling now obsolete learned tool selection? Can test-time search over tool spaces replace ranking entirely?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines