What conditions make training diversity better than individual expert quality?
This explores when a model (or team) trained on many varied-but-imperfect sources beats one trained narrowly on a single high-quality expert — and what conditions flip that trade-off.
This explores when a model (or team) trained on many varied-but-imperfect sources beats one trained narrowly on a single high-quality expert. The corpus suggests the answer hinges less on raw quality and more on what happens *after* training — whether outputs get combined, searched over, or pushed out of distribution.
The clearest case for diversity is when individual errors are uncorrelated. A model trained on many imperfect experts with different biases can converge on a consensus that denoises their independent mistakes — an implicit majority vote that outperforms any single expert on the decision points that matter Can models trained on many imperfect experts outperform each one?. The key word is *uncorrelated*: diversity wins because the experts are wrong in different directions, so averaging cancels the noise. A single high-quality expert has no such cancellation — its blind spots are load-bearing.
Diversity also wins whenever inference does more than emit one answer. When a model feeds into search — evolutionary methods, repeated sampling, recombination — training for varied competent solutions beats optimizing for a single best one, because a collapsed policy literally cannot reach problems that require exploring multiple modes Should training maximize diversity when models feed into search?. Relatedly, diversity is what enables generalization *out* of distribution: quality drives in-distribution accuracy, but diversity is the ingredient that lets a model handle problems it wasn't trained on How do quality, diversity, and complexity affect synthetic data differently?. This is why richer, more confident teacher signals can backfire — they produce concise, certain student traces that ace in-domain tests but lose the epistemic caution needed off-distribution Does richer teacher context hurt student generalization?.
But diversity is not free, and the corpus names a sharp boundary: it only pays off on a foundation of genuine competence. Multi-agent teams beat solo agents on ideation *only* when members hold real senior domain expertise — diverse teams without it underperform even a single competent agent, because cognitive stimulation without grounding produces process losses instead of insight Does cognitive diversity alone improve multi-agent ideation quality?. The same compatibility logic appears in distillation: objectively higher-quality teacher refinements *degrade* a student when they exceed its learning frontier, so the student should filter for what it can actually absorb Does teacher-refined data always improve student model performance?. Quality that the learner can't use is worse than diversity it can.
The deeper reason this question matters is that standard training actively destroys diversity, and the loss is often invisible. Outcome-based RL sharpens the policy globally — concentrating probability on correct answers for solved problems while bleeding diversity away from the unsolved ones Does outcome-based RL diversity loss spread across unsolved problems? — and RL more broadly amplifies one dominant format while suppressing every alternative within the first epoch Does RL training collapse format diversity in pretrained models?. Counteracting that collapse takes deliberate mechanism: step-level critique in the loop to hold open the solution space Do critique models improve diversity during training itself?, role specialization so co-trained agents don't homogenize Can multiple agents stay diverse during training together?, or rewarding semantic diversity directly — which, surprisingly, can raise quality rather than trade against it Can diversity optimization improve quality during language model training?. And whether convergence even hurts is domain-dependent: it helps code, where there's one right answer, but flattens creative writing, where distinctiveness *is* the quality Does preference tuning always reduce diversity the same way?. So the condition isn't 'diversity good, quality bad' — it's that diversity beats expert quality exactly when errors are independent, when something downstream combines or searches the outputs, when the task runs off-distribution, and when there's enough baseline competence for variety to compound instead of just adding noise.
Sources 12 notes
Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.
Vector Policy Optimization trains models to emit varied competent solutions rather than converging to one answer. This unlocks search procedures like evolutionary algorithms to explore and combine modes, solving problems that entropy-collapsed policies cannot reach at all.
Quality drives in-distribution generalization, diversity enables out-of-distribution generalization, and complexity strengthens both. Current evaluation methods collapse these into a single quality metric, causing self-improvement loops to degrade through irreversible diversity loss.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Multi-agent teams substantially outperform solo ideation, but only when members possess genuine senior knowledge. Diverse teams without expertise underperform even a single competent agent, because cognitive stimulation without expertise triggers process losses instead of insight.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
Training generation and critic agents on distinct role-dependent data prevents the overfitting collapse that limits single-agent finetuning to one productive iteration. Removing critics or summarization degrades performance, confirming both components are critical.
DARLING jointly optimizes for quality and semantic diversity using a learned classifier, finding that diversity rewards catalyze exploration and produce higher-quality outputs than quality-only baselines across both creative and mathematical tasks.
RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.