Can direct corpus search beat embedding-based retrieval?
Explore whether agents that issue shell commands over raw text can outperform dense retrieval systems, especially when queries demand exact entity matching and symbolic precision across multiple reasoning steps.
The standard search-agent stack assumes a retriever: a query goes in, a precomputed index returns a ranked list, and the agent reasons over the results. GrepSeek inverts the substrate — the corpus itself is the environment, and the agent finds evidence by issuing executable shell commands (grep and friends) over raw text. The motivation is a specific failure of dense retrieval: embedding-based models semantically conflate distinct entities, so multi-hop queries that hinge on exact symbolic patterns or strict entity-level constraints get the wrong documents. Direct corpus interaction (DCI) recovers the lexical precision that dense vectors blur, and as a side effect eliminates the expensive offline indexing stage and reduces runtime memory via sharded-parallel execution.
The interesting part is the training, because RL directly on a large corpus is unstable — the action space of shell commands over raw text is enormous and rewards are sparse. GrepSeek uses a two-stage pipeline: an answer-aware Tutor and an answer-blind Planner generate verified, causally grounded cold-start trajectories for SFT, then GRPO refines task-oriented search behavior through direct interaction. This is the same architectural move as Can externalized bookkeeping let smaller search agents beat larger ones? applied to the retrieval primitive: scaffold the unstable parts so RL only has to learn the genuinely strategic search behavior.
It would be wrong to read this as "grep beats embeddings." The win is regime-specific — symbolic, entity-constrained, lexically exact needs. Dense retrieval still dominates paraphrase and conceptual matching, where the query and the answer share meaning but no surface tokens; an exact-match agent is blind exactly there. So GrepSeek is best understood as restoring a lexical-precision tool that dense-only pipelines discarded, a complementary perspective rather than a replacement. It also complicates the convenience story behind Why do search agents fail users despite strong benchmark scores?: interpretable, inspectable retrieval programs are auditable in a way a dense top-k is not, but they push the burden of expressing the need into precise command construction the model must learn to author.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does parametric knowledge sabotage context-grounded question answering?
- Does tail distribution collapse in training predict retrieval failure patterns?
- Does selective history retrieval outperform full context inclusion in agent reasoning?
- Why do dense embeddings semantically conflate distinct entities in retrieval?
- When should interpretable search programs replace ranked dense retrieval?
- What paraphrase and conceptual matching tasks favor dense over exact-match retrieval?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can externalized bookkeeping let smaller search agents beat larger ones?
Does offloading routine record-keeping to an environment harness free RL policies to focus on semantic search decisions, and can this approach outperform larger searchers with fewer parameters?
convergent-with: scaffold the unstable mechanics so RL optimizes only the strategic search decisions
-
Do hierarchical retrieval architectures outperform flat ones on complex queries?
Explores whether separating query planning from answer synthesis into distinct architectural components improves performance on multi-hop retrieval tasks compared to unified single-pass approaches.
extends: another multi-hop retrieval win, here via lexical-exact corpus interaction rather than planner/synthesizer split
-
Why do search agents fail users despite strong benchmark scores?
Search evaluation benchmarks show high performance, yet real users remain unsatisfied. What gaps between test conditions and actual search behavior explain this disconnect?
grounds: GrepSeek's interpretable retrieval programs are auditable but shift burden to precise command authoring
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- GrepSeek: Training Search Agents for Direct Corpus Interaction
- A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts
- ZeroSearch: Incentivize the Search Capability of LLMs without Searching
- UR2: Unify RAG and Reasoning through Reinforcement Learning
- Precise Zero-Shot Dense Retrieval without Relevance Labels
- Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
- Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation
- FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
Original note title
treating the corpus as the search environment lets a grep-issuing agent beat dense retrieval where embeddings semantically conflate distinct entities