Can semantic clustering of stakeholders preserve meaningful evaluative diversity without manual curation?
This explores whether you can automatically group stakeholders by meaning (semantic clustering) to build evaluation panels that keep genuinely different viewpoints — without a human hand-picking who's in the room.
This explores whether semantic clustering can stand in for a human curator when assembling diverse stakeholder perspectives for evaluation. The corpus's most direct answer is encouraging but qualified: MAJ-EVAL automatically pulls stakeholder personas out of domain documents using semantic clustering, then stages a three-phase debate among them — and the result transfers across tasks like summarization and dialogue without anyone redesigning the panel by hand Can personas extracted from documents generalize across evaluation tasks?. So the answer to the literal question is 'yes, mechanically' — you can skip manual curation and still ground personas in real perspectives rather than arbitrary roles.
But whether the diversity it preserves is *meaningful* is exactly where the corpus pushes back. One note ran the comparison head-to-head and found that clustering raw stakeholder text is the weaker move: k-means on what people *say* produces more homogeneous, blurrier groups than extracting latent traits like expertise and learning style — capturing who people are, not just their surface vocabulary Can LLMs extract audience traits better than comment similarity?. That's a warning shot for any pure-semantic-clustering approach: similarity in wording isn't the same as a real evaluative axis, so you can end up with personas that look distinct but evaluate identically.
The deeper catch is that diversity alone isn't the goal — diversity *plus competence* is. One study of multi-agent ideation found that cognitively diverse teams only beat a single strong agent when the members actually have senior domain knowledge; diverse-but-shallow teams underperform, because stimulation without expertise creates process losses instead of insight Does cognitive diversity alone improve multi-agent ideation quality?. Translate that to evaluation: clustering that maximizes how *different* your stakeholders sound, while ignoring whether each cluster carries real evaluative grounding, can manufacture noise that reads as diversity.
It's worth noticing the same 'cluster, then route' pattern recurs elsewhere in the corpus and works well — routing each query to a specialized model by semantic cluster beats a single frontier model Can routing beat building one better model?, and versioned capability vectors let agents discover each other by semantic match instead of manual wiring Can semantic capability vectors replace manual agent routing?. Semantic grouping is a proven way to retire hand-curation. The open question the corpus leaves is whether *evaluative* diversity — distinct judgments, not just distinct topics — survives the same automation, and the safer designs hedge: ground judges in evidence rather than vibes Can agents evaluate AI outputs more reliably than language models?, or decompose evaluation into structured stages Can structured pipelines make LLM novelty assessment reliable? so a panel's diversity has something concrete to disagree over.
The thing you didn't know you wanted to know: clustering on what stakeholders *say* and clustering on who they *are* give you different panels, and only the second kind reliably preserves the disagreement that makes a diverse evaluation worth running.
Sources 7 notes
MAJ-EVAL automatically extracts stakeholder personas from domain documents via semantic clustering and orchestrates structured three-phase debate, achieving reproducible evaluation that transfers across tasks like summarization and dialogue without manual redesign. The approach grounds personas in real stakeholder perspectives rather than arbitrary roles.
LLM-extracted latent characteristics like expertise and learning style produce more homogeneous audience clusters than k-means on comment text alone. This captures who people are, not just what they say.
Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.
Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.
Versioned Capability Vectors embedded in HNSW indices couple semantic matching with policy and budget constraints, making capability discovery a first-class operation that scales sub-linearly as agent heterogeneity increases.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.
A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.