Does majority voting reliably signal correctness without risking reward hacking?
This explores whether using model consensus (many samples agreeing) as a stand-in for 'correct' is trustworthy — and where that shortcut quietly breaks or gets gamed.
This explores whether majority voting — sampling an answer many times and trusting the consensus — is a reliable signal of correctness, and whether using it as a training reward invites the model to game it. The corpus gives a genuinely two-sided answer: majority voting is a remarkably strong baseline, but its reliability is conditional, and the moment you turn 'consensus' into a reward you inherit a specific failure mode rather than escaping reward hacking entirely.
On the strong side: majority voting is hard to beat. Across benchmarks it matches or outperforms more elaborate inference methods like Best-of-N and sequential revision, precisely because it sidesteps the unreliable verifiers and shaky self-assessment those methods lean on Why does majority voting outperform more complex inference methods?. That robustness even extends to training: models can self-improve on unlabeled data using majority-vote rewards, bootstrapping because consensus answers tend to be correct Can models improve themselves using only majority voting?. So as a correctness signal, it often works — which is exactly what makes its blind spots dangerous.
The critical catch is that consensus only tracks truth above a competence threshold. Test-time RL via majority vote helps only when the model's prior accuracy is already above roughly 50%; below that, the loop silently amplifies wrong answers — the majority confidently agrees on the wrong thing, and you train toward it When does majority-vote reward actually help test-time learning?. This is the deeper version of 'reward hacking': not a clever model exploiting a loophole, but a reward that is structurally untethered from correctness in the wrong regime. The safe move is to gate — probe each prompt class to confirm you're in the favorable regime before trusting the vote. That mirrors a pattern seen elsewhere in the corpus: rubrics work better as accept/reject *gates* than as dense rewards, precisely because gating preserves a categorical correctness check instead of handing the model a smooth surface to climb and exploit Can rubrics and dense rewards work together without hacking?.
There's also a quieter cost: voting throws information away. Self-consistency picks the majority answer but discards the reasoning in every losing chain, and meta-reasoning over all chains beats plain voting on both accuracy and auditability Does voting discard useful reasoning from losing chains?. And voting's whole premise — that parallel samples will converge on the right answer — collapses on genuinely sequential problems, where chain-of-thought holds an exponential advantage because the answer requires accumulating intermediate steps that short parallel chains can't reconstruct When does sequential reasoning beat parallel voting?. Consensus measures agreement, not difficulty-awareness.
The thread worth carrying away: agreement is not truth, and consistency is not reliability — a point the corpus makes from a different angle when it shows that deterministic, zero-temperature outputs are perfectly consistent yet still just one draw from the model's distribution Does setting temperature to zero actually make LLM outputs reliable?. Related cautions show how reward proxies drift from what you actually want: binary correctness rewards push models toward confident guessing because they never penalize confident wrong answers Does binary reward training hurt model calibration?, and aggregate reward models structurally erase minority-but-correct positions Can aggregate reward models satisfy genuinely disagreeing users?. Majority voting doesn't reward-hack the way an adversarial optimizer does — but treat consensus as a verifier rather than a competence-gated heuristic, and it will reliably mislead you exactly where you can least afford it.
Sources 9 notes
Across benchmarks, majority voting empirically outperforms or matches Best-of-N and sequential revision approaches. Its robustness stems from avoiding unreliable verifiers, poor self-assessment, and unnecessary complexity—making it the right baseline for evaluating reasoning model improvements.
Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.
Test-time RL via consensus succeeds when prior accuracy exceeds ~50%, but below that threshold it silently amplifies wrong answers. Safe deployment requires gated probing per prompt class to confirm the favorable regime before training.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
Standard self-consistency voting selects the majority answer but discards intermediate reasoning from non-winning chains. Multi-chain reasoning instead meta-reasons over all chains simultaneously to extract distributed information, improving both task accuracy and producing coherent, auditable explanations.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.