Can we generate synthetic data without any seed examples?
Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?
Existing synthetic data generation methods generally fall into two categories: prompt-engineered approaches that generalize poorly because they require manual customization for each task, or stochastic evolutionary algorithms that lack explainability and control. Both typically require seed examples drawn from the target distribution, which is unrealistic for genuinely novel domains and may hurt global coverage by anchoring the generation to existing examples.
Simula proposes a different decomposition: separate global coverage from local diversity, and address each through a different mechanism. For global coverage, the system constructs a synthetic taxonomy by alternating between three steps — Best-of-N proposal of children nodes given context, separate critique-only calls that exploit the generator-critic gap in LLMs, and level-completion planning to ensure consistent granularity across siblings. The resulting taxonomy provides granular, explainable control: every dataset characteristic maps to a tree node, so users can see and adjust what is covered.
For local diversity and complexity, Simula uses agentic refinement after taxonomic sampling. Sampling strategies define which sub-taxonomies combine sensibly (a horror novel about a troubled cat for toddlers should be filtered out), and "semantic expansion" generates multiple meta-prompts simultaneously to mitigate mode collapse when the requested sample count exceeds unique node-pairs. Quality control happens through pointwise critique with binary verdicts and double-critic rejection sampling for tasks with defined correctness, mitigating sycophancy bias.
The architectural insight is that "good" synthetic data has irreducibly multiple desiderata — quality, diversity, complexity — and that previous approaches optimized only subsets because they used a single mechanism, exactly the problem How do quality, diversity, and complexity affect synthetic data differently? diagnoses. Decomposing coverage (global) from variation (local) and using different mechanisms for each makes all three controllable simultaneously. This unlocks data generation for domains where seed data does not exist, which is precisely the domains where synthetic data is most needed (medicine, finance, law).
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What would it mean to assign explicit trust weights to synthetic data?
- How does treating synthetic data as empirical evidence contaminate statistical inference?
- What quality of curated data is minimally sufficient for alignment?
- What role should the trust parameter play in using synthetic data as evidence?
- Can synthetic data preserve the diversity needed for transcendence to work?
- What distinguishes instance seeds from full input-output exemplar requirements?
- How do label constraints improve synthetic data without ground truth validation?
- Can synthetic data generation balance all three QDC axes simultaneously?
- How does diversity loss in synthetic data mirror tail distribution disappearance?
- Why does separating global coverage from local variation improve synthetic data generation?
- What sampling strategies prevent nonsensical combinations when composing taxonomy nodes?
- Can deterministic computation actually create new information in data?
- At what point does output quality outweigh diversity value in synthetic data tasks?
- Can synthetic data generation work without seed examples?
- Why is evaluating synthetic data quality so ambiguous and context-dependent?
- Can seedless generation maintain explainability while scaling control?
- What makes seed data a bottleneck in synthetic generation pipelines?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
How do quality, diversity, and complexity affect synthetic data differently?
When training models on synthetic data, do quality, diversity, and complexity each play distinct roles in how well models generalize? Understanding their separate effects could explain why current optimization strategies fail.
extends: Simula is a concrete architecture for the QDC framework's prescriptions — different mechanisms per desideratum
-
Can synthetic data replace seed examples in task generation?
Can models generate high-quality synthetic data for novel tasks without relying on existing input-output exemplars? This matters because many specialized domains lack training examples to work from.
extends: TarGEN replaces input-output exemplars with instance seeds; Simula replaces instance seeds with taxonomies — three-step progression away from seed dependence
-
Can synthetic dialogues become realistic through layered diversity?
Explores whether combining persona variation, subtopic specificity, and contextual grounding can generate synthetic dialogues that match real conversational data quality and capture the full spectrum of dialogue diversity.
exemplifies: dialogue-domain instance of the same global-vs-local decomposition Simula generalizes
-
Should persona simulation prioritize coverage over statistical matching?
Explores whether stress-testing AI systems requires spanning rare user configurations rather than replicating aggregate population statistics. Critical for identifying edge-case failures.
complements: same coverage-vs-density distinction at the persona-generation level
-
Do different AI models actually produce diverse outputs?
Explores whether using multiple different language models together creates genuine diversity or whether shared training and alignment cause them to converge on similar answers despite independence.
tension: Simula's mode-collapse mitigation via semantic expansion targets exactly the hivemind tendency, but coverage-by-taxonomy still depends on the generator's own taxonomic intuitions
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Orchestrating Synthetic Data with Reasoning
- Reasoning-Driven Synthetic Data Generation and Evaluation
- ToolFlow: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis
- Scaling Synthetic Data Creation with 1,000,000,000 Personas
- Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models
- A Little Human Data Goes A Long Way
- CoT-Self-Instruct: Building high-quality synthetic prompts for reasoning and non-reasoning tasks
- TarGEN: Targeted Data Generation with Large Language Models
Original note title
seedless synthetic data generation through taxonomic decomposition replaces seed-data dependence with explainable global coverage control