SYNTHESIS NOTE
Model Architecture and Internals Agentic Systems and Tool Use Reasoning, Retrieval, and Evaluation

Can we generate synthetic data without any seed examples?

Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?

Synthesis note · 2026-05-03 · sourced from Data

Existing synthetic data generation methods generally fall into two categories: prompt-engineered approaches that generalize poorly because they require manual customization for each task, or stochastic evolutionary algorithms that lack explainability and control. Both typically require seed examples drawn from the target distribution, which is unrealistic for genuinely novel domains and may hurt global coverage by anchoring the generation to existing examples.

Simula proposes a different decomposition: separate global coverage from local diversity, and address each through a different mechanism. For global coverage, the system constructs a synthetic taxonomy by alternating between three steps — Best-of-N proposal of children nodes given context, separate critique-only calls that exploit the generator-critic gap in LLMs, and level-completion planning to ensure consistent granularity across siblings. The resulting taxonomy provides granular, explainable control: every dataset characteristic maps to a tree node, so users can see and adjust what is covered.

For local diversity and complexity, Simula uses agentic refinement after taxonomic sampling. Sampling strategies define which sub-taxonomies combine sensibly (a horror novel about a troubled cat for toddlers should be filtered out), and "semantic expansion" generates multiple meta-prompts simultaneously to mitigate mode collapse when the requested sample count exceeds unique node-pairs. Quality control happens through pointwise critique with binary verdicts and double-critic rejection sampling for tasks with defined correctness, mitigating sycophancy bias.

The architectural insight is that "good" synthetic data has irreducibly multiple desiderata — quality, diversity, complexity — and that previous approaches optimized only subsets because they used a single mechanism, exactly the problem How do quality, diversity, and complexity affect synthetic data differently? diagnoses. Decomposing coverage (global) from variation (local) and using different mechanisms for each makes all three controllable simultaneously. This unlocks data generation for domains where seed data does not exist, which is precisely the domains where synthetic data is most needed (medicine, finance, law).

Inquiring lines that use this note as a source 17

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 153 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

seedless synthetic data generation through taxonomic decomposition replaces seed-data dependence with explainable global coverage control