Can LLMs efficiently generate taxonomies and label training data?
Explores whether large language models can automate both taxonomy generation and data labeling to reduce the manual effort and domain expertise traditionally required for text mining tasks.
Text mining couples two interrelated tasks — taxonomy generation (finding and organizing canonical labels for a corpus) and text classification (labeling instances) — and both traditionally rely on expensive domain expertise and manual curation, which breaks when the label space is under-specified and annotations are unavailable. TnT-LLM automates both end-to-end with LLMs in two phases. Phase 1: a zero-shot, multi-stage reasoning approach has the LLM produce and iteratively refine a label taxonomy. Phase 2: LLMs act as data labelers generating pseudo-labels, which train lightweight supervised classifiers that can be deployed and served cheaply at scale.
The keeper is the division of labor: use the expensive LLM for the parts that need open-ended reasoning (inventing and refining the taxonomy, producing training labels), then distill into a cheap classifier for high-volume serving — getting LLM-quality structure without LLM-cost inference. It democratizes text-mining for under-specified label spaces.
This is methodologically relevant to Adrian's own vault pipeline (taxonomy/topic induction + labeling). It rhymes with Can smaller models handle RAG filtering while larger models focus on synthesis? in its tiered use of model capability (big model for structure, small for scale), and with the taxonomy-induction spirit of synthetic-data work like Can we generate synthetic data without any seed examples?.
Inquiring lines that use this note as a source 3
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 2
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can smaller models handle RAG filtering while larger models focus on synthesis?
Does splitting RAG pipeline work between cheaper small models and expensive large models improve both cost and quality? The question asks whether different pipeline stages have different optimal model sizes.
shared tiered-capability pattern: expensive model for hard parts, cheap model at scale
-
Can we generate synthetic data without any seed examples?
Existing synthetic data methods rely on seed examples from the target distribution, which is impractical for novel domains. Can taxonomic decomposition eliminate this dependence while maintaining controllable coverage?
both use LLMs to build a taxonomy as scaffolding for downstream generation/labeling
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- TnT-LLM: Text Mining at Scale with Large Language Models
- Using Large Language Models to Generate, Validate, and Apply User Intent Taxonomies
- A Survey on Post-training of Large Language Models
- TarGEN: Targeted Data Generation with Large Language Models
- Linguistic Blind Spots of Large Language Models
- DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents
- The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
- Long-context LLMs Struggle with Long In-context Learning
Original note title
LLMs can generate a label taxonomy then label data to train lightweight classifiers — automating text mining at scale