INQUIRING LINE

How do knowledge graphs scale as training data for open-ended search tasks?

This explores whether knowledge graphs are a good source of synthetic training data for agents that do open-ended, multi-hop search — and how well that approach holds up as you scale it.


This explores whether knowledge graphs are a good source of synthetic training data for agents that do open-ended, multi-hop search — and how well that approach holds up as you scale it. The most direct answer in the corpus is yes, and the trick is making the questions genuinely hard. Random walks across a knowledge graph naturally generate multi-hop questions with verifiable answers, but if entities are named plainly the questions are too easy to look up. Selectively blurring entities forces an agent to actually reason and search across hops — and this is what lets a 32B model trained on synthetic graph data beat much larger models on hard browsing benchmarks Can knowledge graphs generate training data for search agents?. The scaling story here isn't 'more data,' it's that graph structure lets you manufacture difficulty on demand, cheaply and with built-in answer checking.

That last point — verifiable answers — is why graphs pair so well with reinforcement learning. The deeper pattern across the corpus is that structured knowledge consistently beats raw text volume. A medical knowledge-graph curriculum of reasoning tasks produces domain expertise that scale alone doesn't Can knowledge graphs teach models deep domain expertise?, and organizing training chunks into a taxonomy reaches half of full-corpus performance using a fraction of a percent of the data Can organizing knowledge structures beat raw training data volume?. The reason is that the model learns where a fact sits in a conceptual structure rather than memorizing surface patterns — closer to how a student learns from a textbook than from flashcards.

But 'scale' cuts two ways, and the corpus is interesting on the cost of the graphs themselves. Pre-building a corpus-wide knowledge graph is expensive and goes stale; one line of work builds small query-specific logic graphs at inference time instead, keeping the multi-hop reasoning while dropping the construction overhead Can query-time graph construction replace pre-built knowledge graphs?. And once a graph is large, you can't read all of it — so learned traversal policies using tree search and RL let an agent walk the graph selectively within a context window, trading certainty about the whole graph for tractable navigation Can learned traversal policies beat exhaustive graph reading?. Symbolic rules pulled from graph topology can serve as navigation plans that align plain-language questions with the graph's actual structure Can symbolic rules from knowledge graphs guide complex reasoning?.

The part you might not expect: graphs aren't always the right structure, and search itself behaves like a scaling axis. Routing each query to the knowledge structure that fits it — sometimes a graph, sometimes a table or a plain catalogue — beats forcing everything through graphs uniformly Can routing queries to task-matched structures improve RAG reasoning?. And for the open-ended search task itself, the number of search iterations an agent spends shows the same diminishing-returns curve as reasoning tokens, meaning search budget is a tunable inference-compute dial, not just a fixed retrieval step Does search budget scale like reasoning tokens for answer quality?. If you want the full loop of training search agents cheaply, the corpus also has work on simulating the search engine entirely from an LLM's internal knowledge to avoid API costs during RL Can LLMs replace search engines during agent training? — a natural companion to graph-generated questions, since together they give you both the questions and the answers without paying for either.


Sources 9 notes

Can knowledge graphs generate training data for search agents?

KG-based random walks with selective entity obscuring create verifiable, multi-hop questions that train deep search agents effectively. DeepDive-32B trained on this data achieves 14.8% on BrowseComp, outperforming larger models through end-to-end multi-turn RL.

Can knowledge graphs teach models deep domain expertise?

Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.

Can organizing knowledge structures beat raw training data volume?

StructTuning achieves 50% of full-corpus performance using only 0.3% of training data by organizing chunks into auto-generated domain taxonomies. The model learns knowledge position within conceptual structures rather than raw text patterns, matching how students learn from textbooks.

Can query-time graph construction replace pre-built knowledge graphs?

LogicRAG constructs directed acyclic graphs from queries at inference time rather than pre-building corpus-wide graphs, eliminating construction overhead, avoiding staleness, and enabling query-specific retrieval logic without sacrificing multi-hop reasoning capability.

Can learned traversal policies beat exhaustive graph reading?

Graph-O1 replaces whole-graph ingestion with step-by-step agentic navigation using Monte Carlo Tree Search and reinforcement learning. This approach fits within LLM context windows while learning domain-specific traversal policies, though it trades certainty about the full graph for decision-making under uncertainty.

Can symbolic rules from knowledge graphs guide complex reasoning?

SymAgent derives symbolic rules from KG structure using LLM reasoning to create navigational plans that align natural language with graph topology. This approach captures structural reasoning patterns explicitly, outperforming retrieval methods that rely on semantic similarity alone.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

Can LLMs replace search engines during agent training?

ZeroSearch and SSRL demonstrate that LLMs can generate relevant documents and search results from internal knowledge, with 14B simulators matching or exceeding real search engines. Curriculum degradation and test-time scaling optimize this approach for training without API costs.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI research analyst. The question remains open: **How do knowledge graphs scale as training data for open-ended search tasks?** Treat the findings below as dated claims (spanning 2024–2025), not current truth. Re-test them.

What a curated library found — and when (dated claims, not perishable truth):
• Entity-blurring on graph random walks forces genuine multi-hop reasoning; a 32B model trained on synthetic graph data outperforms much larger models on hard benchmarks (2024–2025).
• Structured knowledge (graph curricula, taxonomic organization) beats raw text volume; one approach achieves 50% of full-corpus performance with <1% of data (2024).
• Inference-time query-specific logic graphs avoid pre-build costs and staleness; learned traversal policies + tree search + RL enable selective graph navigation within context windows (~2025).
• Routing queries to task-appropriate structures (graph, table, catalogue) beats forcing all queries through graphs uniformly (2024–2025).
• Search budget exhibits test-time scaling laws similar to reasoning tokens; agents can simulate search engines from internal knowledge, eliminating API costs during RL (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2407.16724 (2024-07): *Educating LLMs like Human Students: Structure-aware Injection of Domain Knowledge*
• arXiv:2510.08815 (2024-10): *StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information*
• arXiv:2506.18959 (2025-06): *From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents*
• arXiv:2509.10446 (2025-09): *DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL*

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, determine whether newer models, RL methods, orchestration (memory, multi-agent, caching), or evaluation harnesses have since RELAXED or OVERTURNED it. Separate the durable question ("Does structure beat scale in training?") from perishable limitations ("graphs must be pre-built," "all queries need graphs"). Cite what resolved each constraint; flag what still holds.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last 6 months.** Look for papers that question whether graphs are necessary, whether synthetic graph data generalizes to real open-ended search, or whether simpler retrieval structures compete equally. Note disagreements on when routing beats uniform application.
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., "Do graph-trained agents transfer to search tasks outside their domain graph?" or "At what model scale does the structured-vs.-raw advantage invert?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines