SYNTHESIS NOTE

Can models trained on many imperfect experts outperform each one?

Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.

Synthesis note · 2026-02-22 · sourced from Training Fine Tuning

The Transcendence paper formalizes a surprising property: generative models trained on many experts with diverse capacities and biases can outperform any single expert. The mechanism is implicit majority voting. When trained on diverse human players (chess), the model's cross-entropy optimization converges on the consensus behavior — which, by the wisdom-of-the-crowd effect, is often better than any individual contributor.

Low-temperature sampling is the key enabler. At low temperature, the model's output distribution concentrates on its highest-probability predictions — the consensus. This is formally equivalent to a majority vote. The advantage is primarily due to performing much better on a small subset of states — likely the critical, outcome-determining positions where individual human biases diverge most and the crowd wisdom is most valuable.

Diversity in the training data is a necessary condition. Without diversity, there is no denoising — a model trained on clones of one expert can only approach that expert's level. The practical conditions for transcendence: (1) diverse training sources with different biases, (2) a task where individual biases are uncorrelated (so they cancel under aggregation), and (3) low-temperature decoding to extract the consensus.

This connects to but is distinct from Why does majority voting outperform more complex inference methods?. That note describes inference-time majority voting over multiple samples from one model. Transcendence describes training-time majority voting implicitly encoded in a single model's weights through diverse training data. The mechanism is analogous — aggregation denoises — but operates at different timescales.

The implication for LLM training is provocative: the "average" of many imperfect human demonstrations may be better than any individual human demonstration, provided the imperfections are diverse rather than correlated. This challenges the assumption that training data quality should be maximized per-example; quantity and diversity of perspectives may matter as much as individual quality.

Inquiring lines that read this note 23

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do language models inherit human biases from training data?

How does same-author bias interact with the four adversarial judge biases already documented?

How do multi-agent systems achieve genuine cooperation and reasoning?

Why does diversity without expertise produce worse results than a single capable agent?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Why do readers trust citations and complexity regardless of accuracy?

How do experts select which other experts to trust?

How does test-time aggregation affect reasoning correctness and reliability?

When does optimizing for quality undermine the value of diversity?

Why do persona-level simulations fail to predict individual preferences accurately?

Can individually accurate agents still fail at population-level representation?

How can we distinguish genuine user preferences from measurement artifacts?

What information is lost when majority labels discard minority interpretations?

What are the consequences of models training on synthetic data?

How do social dynamics and selection effects compound in rating aggregates?

How much noise comes from rater idiosyncrasy versus selection bias?

What articulatory information do speech signals carry that text cannot?

How does mixture of experts enable flexible capacity sharing between modalities?

Can ensemble evaluation methods reduce bias more than single judges?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

Can untrained aggregators waste the benefits of parallel sampling?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

16 direct connections · 166 in 2-hop network ·dense cluster Open in graph ↗

Can models trained on many imperfect experts out… Why does majority voting outperform more complex i… Does voting discard useful reasoning from losing c… Does training on AI-generated content permanently … Can generative and discriminative models reach agr…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why does majority voting outperform more complex inference methods? Simple majority voting across independent samples often matches or beats sophisticated alternatives like Best-of-N and sequential revision. What makes this basic approach so hard to beat for reasoning models?
inference-time voting analog; this is the training-time version
Does voting discard useful reasoning from losing chains? When multiple reasoning chains compete through majority voting, intermediate steps from non-winning chains are discarded. Could extracting and mixing those intermediate facts improve both the final answer and our ability to understand the reasoning?
shows limits of pure voting; transcendence may have similar limits
Does training on AI-generated content permanently degrade model quality? When generative models train on outputs from previous models, do the resulting models lose rare patterns permanently? The question matters because future training data will inevitably contain synthetic content.
counterpoint: while diversity enables transcendence, synthetic data collapses diversity
Can generative and discriminative models reach agreement? Generative and discriminative decoding often produce conflicting answers. Can a game-theoretic framework force these two complementary procedures to reconcile their predictions into a single, more reliable output?
related consensus mechanism: transcendence achieves consensus across diverse training experts at training time, while Consensus Game achieves consensus between generative and discriminative decoding modes at inference time; both extract a signal more reliable than any single perspective

Can models trained on many imperfect experts outperform each one?

Inquiring lines that read this note 23

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 5