Can LLMs efficiently generate taxonomies and label training data?

Explores whether large language models can automate both taxonomy generation and data labeling to reduce the manual effort and domain expertise traditionally required for text mining tasks.

Synthesis note · 2026-06-03 · sourced from Work Application Use Cases

Text mining couples two interrelated tasks — taxonomy generation (finding and organizing canonical labels for a corpus) and text classification (labeling instances) — and both traditionally rely on expensive domain expertise and manual curation, which breaks when the label space is under-specified and annotations are unavailable. TnT-LLM automates both end-to-end with LLMs in two phases. Phase 1: a zero-shot, multi-stage reasoning approach has the LLM produce and iteratively refine a label taxonomy. Phase 2: LLMs act as data labelers generating pseudo-labels, which train lightweight supervised classifiers that can be deployed and served cheaply at scale.

The keeper is the division of labor: use the expensive LLM for the parts that need open-ended reasoning (inventing and refining the taxonomy, producing training labels), then distill into a cheap classifier for high-volume serving — getting LLM-quality structure without LLM-cost inference. It democratizes text-mining for under-specified label spaces.

This is methodologically relevant to Adrian's own vault pipeline (taxonomy/topic induction + labeling). It rhymes with Can smaller models handle RAG filtering while larger models focus on synthesis? in its tiered use of model capability (big model for structure, small for scale), and with the taxonomy-induction spirit of synthetic-data work like Can we generate synthetic data without any seed examples?.

Inquiring lines that read this note 4

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does model scaling alone produce compositional generalization without symbolic mechanisms?

What makes hierarchical reasoning effective for taxonomy induction?

How do language models inherit human biases from training data?

Does pseudo-labeling from LLMs degrade classifier performance?

Can alternative training methods improve on supervised fine-tuning for language models?

Can information-gain principles improve how we choose what to label?

How should human oversight be integrated with autonomous AI systems?

How do closed-loop automated venues differ from human-in-the-loop review taxonomies?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 131 in 2-hop network ·dense cluster Open in graph ↗

Can LLMs efficiently generate taxonomies and lab… Can smaller models handle RAG filtering while larg… Can we generate synthetic data without any seed ex…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can smaller models handle RAG filtering while larger models focus on synthesis? Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.
shared tiered-capability pattern: expensive model for hard parts, cheap model at scale
Can we generate synthetic data without any seed examples? Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?
both use LLMs to build a taxonomy as scaffolding for downstream generation/labeling

Can LLMs efficiently generate taxonomies and label training data?

Inquiring lines that read this note 4

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 4