INQUIRING LINE

What makes intent taxonomies unmanageable at hundreds of intents?

This explores why classifying user requests into a fixed list of named intents breaks down as that list grows — and what the corpus offers as alternatives.


This reads the question as being about the classic dialogue-system design where every user utterance has to be sorted into one of a predefined set of 'intents' — and why that design buckles once the set reaches hundreds. The corpus has a sharp, direct answer and several lateral ones that explain the deeper reason it was never going to scale.

The most on-the-nose material comes from Rasa's reframing of dialogue understanding as command generation rather than intent classification Can command generation replace intent classification in dialogue systems?. The argument there is that intent classification is the wrong primitive: every new intent demands fresh annotated examples, the categories start overlapping at the edges (is this a 'reschedule' or a 'cancel-and-rebook'?), and accuracy degrades as the label space grows. Generating a domain-specific command instead of picking from a flat menu sidesteps all three — no annotation burden, context handled naturally, and scaling without the degradation. The taxonomy isn't unmanageable because hundreds is a big number; it's unmanageable because the format forces you to carve continuous, context-dependent meaning into discrete mutually-exclusive boxes.

The most interesting lateral comes from retrieval failure analysis Where do retrieval systems fail and why?, which names a hard ceiling: embedding dimension mathematically constrains how many distinct items a vector space can cleanly separate, and embeddings measure association rather than relevance. Map that onto intents and you get a structural reason for the plateau — past some point, two intents simply cannot be reliably distinguished in the representation, no matter how much you tune the classifier. This is the same wall, surfacing in a different subfield.

There's also a cognitive-load echo in argument-scheme classification Why does argument scheme classification stumble where other NLP tasks succeed?, where models stall at F1 0.55–0.65 on tasks requiring integrative pattern recognition while sailing past 0.80 on simpler tagging. Fine-grained intent disambiguation is exactly that kind of integrative task, which is why adding more classes hits a quality cliff rather than a gentle slope.

What you might not have known you wanted: the corpus suggests the real fix isn't a better taxonomy but abandoning the discrete taxonomy altogether. Work on discovering persistent user-interest 'journeys' shows people's actual goals are things like 'designing hydroponic systems for small spaces' llms-can-discover-and-describe-persistent-user-interest-journeys — far too specific and personal to ever be a category in any hand-built list. And multi-facet identifier research Can item identifiers balance uniqueness and semantic meaning? makes the general point that no single discrete label can carry both distinctiveness and meaning at once; you need structured, generated representations. The pattern across all of these: discrete labels stop scaling long before reality does, and generation — of commands, of descriptions, of structured identifiers — is the corpus's recurring escape hatch.


Sources 5 notes

Can command generation replace intent classification in dialogue systems?

Rasa's dialogue understanding architecture generates domain-specific commands instead of classifying intents, eliminating annotation requirements, handling context naturally, and scaling without degradation—treating understanding as pragmatics rather than semantics.

Where do retrieval systems fail and why?

RAG systems fail at three structural levels: adaptive triggering (fixed intervals waste context), semantic-task mismatch (embeddings measure association, not relevance), and mathematical limits (embedding dimension constrains representable document sets). These require fundamentally different retrieval approaches, not tuning.

Why does argument scheme classification stumble where other NLP tasks succeed?

Scheme classification requires recognizing inferential patterns across distributed text spans, not local surface features. Models plateau at F1 0.55–0.65 while the same systems exceed 0.80 on component tagging and stance, suggesting the integrative reasoning demand is fundamentally different.

Can item identifiers balance uniqueness and semantic meaning?

TransRec shows that combining numeric IDs, titles, and attributes into structured identifiers solves three problems simultaneously: distinctiveness from IDs, semantics from text, and generation grounding from structural constraints. Neither pure IDs nor pure text alone achieves all three.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a dialogue-systems researcher re-evaluating the scalability of intent taxonomies. The question remains open: what makes intent taxonomies unmanageable at hundreds of intents, and has that constraint been relaxed?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 across dialogue understanding, representation learning, and LLM reasoning:
• Intent classification forces continuous meaning into discrete mutually-exclusive boxes; the real problem is the primitive, not the count (Rasa reframing, ~2024).
• Embedding dimension creates a mathematical ceiling: vector spaces cannot cleanly separate arbitrarily many intents; this is a structural, not a tuning, constraint (~2025).
• Fine-grained intent disambiguation hits an F1 plateau of 0.55–0.65 because it requires integrative pattern recognition; coarser tasks sail past 0.80 (argument-scheme classification, ~2024).
• Real user goals are persistent journeys (e.g., 'designing hydroponic systems for small spaces') — far too specific and personal for any hand-built taxonomy (~2023).
• Multi-facet identifiers (structured, generated representations combining ID, title, attributes) outperform single discrete labels in both distinctiveness and meaning (~2023).

Anchor papers (verify; mind their dates):
• arXiv:2305.15498 (2023) — User Interest Journeys
• arXiv:2404.00750 (2024) — Argument Recognition (F1 plateaus)
• arXiv:2602.07338 (2026) — Intent Mismatch in Multi-Turn Conversation
• arXiv:2612.24601 (2025) — Recursive Language Models

Your task:
(1) RE-TEST: For each constraint above — discrete-label brittleness, embedding-dimension ceiling, integrative-task F1 plateau, and journey-discovery — judge whether recent advances in multi-turn context handling (e.g., 2026-02 Intent Mismatch), recursive reasoning (2025-12), or personalized memory (2025-07 PRIME) have RELAXED the tension between discrete taxonomies and real user goals. Separate the durable question (how to represent intent at scale) from the perishable limitation (whether discrete taxonomies are the only approach). Cite what resolved it, if anything.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: has anyone published a scalable discrete-intent system, or does the corpus still point toward command generation and structured representations as the escape hatch?
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., 'Can personalized cognitive memory (PRIME) enable dynamic, user-specific intent taxonomies?' or 'Do recursive language models dissolve the intent-recognition task into hierarchical command generation?'

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines