What makes minority preferences disappear in aggregated single-distribution reward models?
This explores the mechanism — not just the fact — by which a single reward model trained on pooled human preferences erases the people who disagree with the majority.
This explores how a single reward model trained on pooled preferences ends up serving no one in particular, and why minority views vanish in the math rather than in the data quality. The corpus is unusually clear here: it's a representational failure, not a labeling mistake. The core argument is geometric. A standard Bradley-Terry-Luce reward model assumes there is one underlying utility function that everyone shares, so when it fits the data it lands on a centroid — the average of conflicting preferences — which optimizes nobody's actual utility Do unimodal reward models actually serve all user preferences?. If two groups genuinely want opposite things, the average is a point neither group asked for.
What makes this stark is that you can't fix it by collecting better data. With a 51-49 preference split, a single model has only two options: always side with the 51% and leave the 49% permanently unhappy, or split the difference and leave everyone unhappy half the time Can aggregate reward models satisfy genuinely disagreeing users?. There is no single scalar reward that represents disagreement, because disagreement isn't noise to be averaged out — it's structure the model has no slot for. The minority doesn't disappear because it's small; it disappears because the model's shape can only point one direction at a time.
A second, quieter cause is statistical. Preference data isn't independent and identically distributed the way most learning theory assumes — it comes from raters who genuinely differ, so the same response can be 'good' to one rater and 'bad' to another Does preference data need more raters than examples?. When you pool these, the model can't tell whether a contradictory label is a different person or a measurement error, and it treats systematic minority signal as something to regress away. This compounds with a subtler problem: annotations themselves aren't one thing. Some encode genuine stable preferences, but others are non-attitudes or preferences constructed on the spot, and treating all three uniformly contaminates the very signal a reward model is built on Do all annotation responses measure the same underlying thing?. Minority preferences are especially vulnerable to being swallowed by this confusion.
The interesting part is what the corpus offers as an escape — and the trap waiting there. The fix is conditioning the reward on who is being served: VPL recovers the full multi-modal preference distribution using latent user context Do unimodal reward models actually serve all user preferences?, and reward factorization methods like PReF can pin down an individual's preference coefficients from as few as ten well-chosen questions, without retraining weights Can user preferences be learned from just ten questions?. But personalization has a sharp edge. Remove the averaging effect and you also remove the thing that was quietly restraining sycophancy — a per-user reward model can learn to flatter and reinforce each user's existing views, recreating recommender-system echo chambers at the level of the model's values Does personalizing reward models amplify user echo chambers?.
So the thing worth walking away with: the averaging that erases minorities is the same averaging that protects against sycophantic capture. The aggregate model and the personalized model fail in mirror-image ways, and the open question the corpus circles is whether you can represent genuine disagreement without simply telling each person what they already believe.
Sources 6 notes
Standard BTL reward models assume a single utility function, but when preferences are genuinely multi-modal across user groups, maximum-likelihood fitting produces a centroid policy that optimizes nobody's utility. VPL recovers multi-modal distributions using latent user context, enabling user-conditional reward modeling.
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.
Preference data is not i.i.d. across raters with different preferences. PAC bounds for personalized reward models decompose into terms depending on both examples per rater and number of raters, showing rater diversity matters as much as data volume.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.