INQUIRING LINE

Why does low temperature sampling extract consensus from diverse training data?

This explores why turning the temperature dial down doesn't just make outputs repetitive — it pulls the model toward a kind of majority vote baked into data that came from many disagreeing sources.


This explores why low-temperature sampling acts like a consensus extractor rather than just a repeatability switch. The cleanest answer in the corpus comes from work on models trained across many imperfect experts: when a generative model learns from a crowd of teachers who each carry their own biases, cross-entropy optimization pushes it toward the *center of mass* of their behavior rather than any one teacher's quirks. Low temperature is what makes that center visible — by sampling the highest-probability path, you read off the implicit majority vote, and because individual experts' errors are uncorrelated, that vote denoises them and can outperform every single expert it learned from Can models trained on many imperfect experts outperform each one?. So the consensus isn't created by low temperature; it's *surfaced* by it. The diversity of the training data is exactly what makes the averaged signal trustworthy.

The same logic shows up from the opposite direction in test-time self-improvement. Models can bootstrap on unlabeled data by sampling many answers and rewarding whichever the crowd agrees on — and this works precisely because consensus answers tend to be correct Can models improve themselves using only majority voting?. That's the temperature story in reverse: high-temperature sampling spreads you across the distribution so you can *find* the consensus by voting, where low-temperature sampling collapses you straight onto it. Both lean on the same assumption — that the mode of a distribution learned from diverse sources carries denoised signal.

But consensus has a cost, and the corpus is sharp about it. Pulling toward the agreement point means suppressing everything else. RL post-training does this aggressively, amplifying one dominant format from pretraining within a single epoch while collapsing the alternatives — and the format that 'wins' depends on model scale, not necessarily on being better Does RL training collapse format diversity in pretrained models?. Whether that collapse helps or hurts turns out to be domain-dependent: convergence toward a single answer is a feature when code generation rewards correctness, but a bug when creative writing rewards distinctiveness Does preference tuning always reduce diversity the same way?. The consensus low temperature extracts is only as good as the thing the domain actually wants.

There's also a trap worth knowing about. Consistency is not the same as reliability. Zero temperature and a fixed seed will reproduce the *same* output every time, but that output is still a single draw from the distribution — repeating it 100 times tells you nothing about whether it was a good draw Does setting temperature to zero actually make LLM outputs reliable?. So the consensus you read off the mode is meaningful only when the underlying distribution genuinely encodes denoised agreement (many diverse experts, votable correct answers). When it doesn't, low temperature just gives you a confidently repeated guess. And consensus mechanisms can fail outright at the system level: when you make LLM *agents* negotiate agreement explicitly, they tend to stall out rather than converge, with agreement degrading as the group grows Can LLM agent groups reliably reach consensus together? — a reminder that the implicit statistical consensus inside one model is a very different, and more robust, thing than consensus assembled across many.

The thing you didn't know you wanted to know: low temperature isn't a reliability knob, it's a *readout* knob. It exposes whatever consensus the training distribution already contains — denoised wisdom when the data is diverse and the errors cancel, or a brittle single guess when they don't. The interesting question is never 'should I lower temperature' but 'does my distribution actually have a consensus worth extracting.'


Sources 6 notes

Can models trained on many imperfect experts outperform each one?

Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Can LLM agent groups reliably reach consensus together?

Across hundreds of simulations, LLM-agent groups frequently fail to reach valid agreement due to timeouts and stalled convergence rather than subtle value corruption. Agreement degrades with group size even without Byzantine agents present.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains: Why does low-temperature sampling extract consensus from diverse training data, and when does that consensus actually encode denoised signal vs. a brittle single guess?

What a curated library found — and when (dated claims, not current truth):
The library spans 2024–2026. Key findings:
• Low temperature surfaces implicit majority-vote consensus learned from diverse experts; individual errors cancel, outperforming single teachers (~2024-06, arXiv:2406.11741).
• High-temperature sampling lets models bootstrap consensus via voting on unlabeled data; consensus answers tend to be correct (~2024-10, arXiv:2410.08020).
• RL post-training aggressively amplifies one dominant pretraining format within a single epoch, collapsing alternatives; the 'winner' depends on scale, not quality (~2025-04, arXiv:2504.07912).
• Consensus extraction is domain-dependent: it's a feature for code (rewards correctness) but a bug for creative writing (rewards distinctiveness) (~2025-04, arXiv:2504.12522).
• LLM agents explicitly negotiating consensus stall and degrade as group size grows; implicit statistical consensus within one model is far more robust (~2026-03, arXiv:2603.01213).
• Deterministic settings (zero temp + fixed seed) reproduce outputs but don't validate reliability; a repeated guess is still one draw (~2025-08, arXiv:2508.15260).

Anchor papers (verify; mind their dates):
• arXiv:2406.11741 (2024-06) — Transcendence: Generative Models Can Outperform The Experts That Train Them
• arXiv:2504.07912 (2025-04) — Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
• arXiv:2504.12522 (2025-04) — Evaluating the Diversity and Quality of LLM Generated Content
• arXiv:2603.01213 (2026-03) — Can AI Agents Agree?

Your task:
(1) RE-TEST EACH CONSTRAINT. For the six findings above, judge whether newer models, methods (scaling, ensemble inference, mixture-of-experts), training (multi-teacher, curriculum), tooling (adaptive temperature schedules), or evaluation (post-hoc consensus validation) have since RELAXED or OVERTURNED them. Separate the durable question — when does consensus extraction guarantee denoising? — from the perishable limitation — does RL collapse kill diversity on current-scale models?. Cite what moved the constraint, and say plainly where it still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months on whether consensus is a proxy for correctness, or on methods that bypass low-temp readout altogether (e.g., confidence-aware sampling, hierarchical voting, self-supervised consensus validation).
(3) Propose two research questions that ASSUME the regime may have moved: (a) Can you certify *when* a distribution's consensus is denoised vs. spurious, without retraining? (b) Do newer architectures (state-space, recurrent depth, moe) encode consensus differently, or is the low-temperature readout universal?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines