Can we generate synthetic data without any seed examples?

Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?

Synthesis note · 2026-05-03 · sourced from Data

Existing synthetic data generation methods generally fall into two categories: prompt-engineered approaches that generalize poorly because they require manual customization for each task, or stochastic evolutionary algorithms that lack explainability and control. Both typically require seed examples drawn from the target distribution, which is unrealistic for genuinely novel domains and may hurt global coverage by anchoring the generation to existing examples.

Simula proposes a different decomposition: separate global coverage from local diversity, and address each through a different mechanism. For global coverage, the system constructs a synthetic taxonomy by alternating between three steps — Best-of-N proposal of children nodes given context, separate critique-only calls that exploit the generator-critic gap in LLMs, and level-completion planning to ensure consistent granularity across siblings. The resulting taxonomy provides granular, explainable control: every dataset characteristic maps to a tree node, so users can see and adjust what is covered.

For local diversity and complexity, Simula uses agentic refinement after taxonomic sampling. Sampling strategies define which sub-taxonomies combine sensibly (a horror novel about a troubled cat for toddlers should be filtered out), and "semantic expansion" generates multiple meta-prompts simultaneously to mitigate mode collapse when the requested sample count exceeds unique node-pairs. Quality control happens through pointwise critique with binary verdicts and double-critic rejection sampling for tasks with defined correctness, mitigating sycophancy bias.

The architectural insight is that "good" synthetic data has irreducibly multiple desiderata — quality, diversity, complexity — and that previous approaches optimized only subsets because they used a single mechanism, exactly the problem How do quality, diversity, and complexity affect synthetic data differently? diagnoses. Decomposing coverage (global) from variation (local) and using different mechanisms for each makes all three controllable simultaneously. This unlocks data generation for domains where seed data does not exist, which is precisely the domains where synthetic data is most needed (medicine, finance, law).

Inquiring lines that read this note 17

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How can humans calibrate appropriate trust in AI systems?

Why can't humans reliably detect AI-generated text despite measurable linguistic signatures?

How does treating synthetic data as empirical evidence contaminate statistical inference?

How can AI alignment serve diverse human preferences at scale?

What quality of curated data is minimally sufficient for alignment?

What are the consequences of models training on synthetic data?

How do training priors constrain what context information can override?

How do label constraints improve synthetic data without ground truth validation?

Does model scaling alone produce compositional generalization without symbolic mechanisms?

What sampling strategies prevent nonsensical combinations when composing taxonomy nodes?

When does optimizing for quality undermine the value of diversity?

At what point does output quality outweigh diversity value in synthetic data tasks?

What dimensions of recommendation quality do standard metrics miss?

Why is evaluating synthetic data quality so ambiguous and context-dependent?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 157 in 2-hop network ·dense cluster Open in graph ↗

Can we generate synthetic data without any seed … How do quality, diversity, and complexity affect s… Can synthetic data replace seed examples in task g… Can synthetic dialogues become realistic through l… Should persona simulation prioritize coverage over… Do different AI models actually produce diverse ou…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How do quality, diversity, and complexity affect synthetic data differently? When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.
extends: Simula is a concrete architecture for the QDC framework's prescriptions — different mechanisms per desideratum
Can synthetic data replace seed examples in task generation? Can models generate high-quality synthetic data for novel tasks without relying on existing input-output exemplars? This matters because many specialized domains lack training examples to work from.
extends: TarGEN replaces input-output exemplars with instance seeds; Simula replaces instance seeds with taxonomies — three-step progression away from seed dependence
Can synthetic dialogues become realistic through layered diversity? Explores whether combining persona variation, subtopic specificity, and contextual grounding can generate synthetic dialogues that match real conversational data quality and capture the full spectrum of dialogue diversity.
exemplifies: dialogue-domain instance of the same global-vs-local decomposition Simula generalizes
Should persona simulation prioritize coverage over statistical matching? Explores whether stress-testing AI systems requires spanning rare user configurations rather than replicating aggregate population statistics. Critical for identifying edge-case failures.
complements: same coverage-vs-density distinction at the persona-generation level
Do different AI models actually produce diverse outputs? Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.
tension: Simula's mode-collapse mitigation via semantic expansion targets exactly the hivemind tendency, but coverage-by-taxonomy still depends on the generator's own taxonomic intuitions

Can we generate synthetic data without any seed examples?

Inquiring lines that read this note 17

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4