How do binary comparisons constrain reward scale in multi-user preference learning?
This explores a structural limit: when you train reward models on binary 'A beats B' choices instead of absolute scores, you lose information about *how much* better A is — and that lost magnitude becomes a real problem once you're learning from many disagreeing users at once.
This explores a structural limit: binary preference comparisons throw away magnitude. A pairwise 'A is better than B' tells you the sign of a difference but not its size, so the learned reward has an arbitrary scale — and the corpus suggests that arbitrariness, harmless with one user, compounds badly once you're serving many.
The sharpest version of the problem is calibration. When rewards collapse to a binary correct/incorrect signal, models are pushed toward confident guessing because nothing penalizes a confident wrong answer — the reward can't distinguish 'barely right' from 'certainly right.' Does binary reward training hurt model calibration? shows this is provable, and that bolting on a proper scoring rule (Brier score) as a second reward term restores the lost scale by forcing the model to commit to a magnitude, not just a direction. That's the single-user symptom of a scale-free reward.
Now add multiple users, and the missing scale becomes a representational crisis. A single reward model trained on aggregated binary preferences literally cannot encode disagreement: a 51–49 split forces it to either always satisfy the majority or satisfy everyone half the time, because there's no axis on which 'this group feels strongly, that group is indifferent' can be expressed Can aggregate reward models satisfy genuinely disagreeing users?. Binary comparisons flatten intensity into a vote, and votes have no scale. This is why the statistics of preference data don't behave like ordinary data — Does preference data need more raters than examples? shows the error bounds depend on the *number of raters*, not just the number of examples, precisely because each rater carries a different latent scale you can't recover from comparisons alone.
The corpus also suggests the binary signal is noisier than it looks at the source. Annotation clicks aren't all the same thing — they decompose into genuine preferences, non-attitudes, and on-the-spot constructed preferences, and treating them uniformly contaminates the reward Do all annotation responses measure the same underlying thing?. A binary comparison can't tell a deeply held judgment from a coin-flip, which is exactly the scale information multi-user learning needs most.
The interesting move is the corpus's escape routes, which all work by *not* relying on absolute binary labels. POLAR reframes reward modeling as measuring distance from a target policy, eliminating absolute preference labels entirely and getting a continuous scale for free Can reward models learn by comparing policies instead of judging them?. Reward factorization keeps things personal by representing each user as a linear combination of shared base rewards, inferable from about ten adaptive questions — recovering per-user scale without per-user retraining Can user preferences be learned from just ten questions?. And natural-language critiques carry the magnitude and the *why* that a numerical-let-alone-binary reward structurally cannot, which is how they break plateaus where scalar rewards stall Can natural language feedback overcome numerical reward plateaus?. The throughline: binary comparisons constrain reward scale by design, and the fix is always to restore a richer signal — a second scoring term, a distance, a per-user basis, or words.
Sources 7 notes
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.
Preference data is not i.i.d. across raters with different preferences. PAC bounds for personalized reward models decompose into terms depending on both examples per rater and number of raters, showing rater diversity matters as much as data volume.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.