Can aggregate reward models satisfy genuinely disagreeing users?

When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?

Synthesis note · 2026-05-18 · sourced from Recommenders Personalized

A clean argument for why aggregate reward models cannot serve disagreement-heavy tasks. Consider a subjective question where 51% of the target audience prefer answer A and 49% prefer answer B. With a single reward model trained on aggregated preferences, the deployment has exactly two options. Pick A as the preferred answer: 49% of users are unhappy 100% of the time. Sample A and B proportionally to their preference rates: 100% of users are unhappy approximately half the time. Both options are unsatisfactory.

The structural problem is that aggregate reward models compress preference distributions into single scalars (or single rankings) that cannot represent disagreement. They reward what the majority prefers and incidentally suppress what the minority prefers. For tasks with high consensus this is fine — the majority preference is everyone's preference. For tasks with genuine disagreement — subjective evaluations, value-laden topics, creative judgment, cultural-context-dependent choices — aggregate models systematically exclude the minority view.

This is not a quality problem with current reward models. It is a representational problem with the aggregation step itself. Even a perfect aggregate reward model would face this dilemma. The fix has to operate at a different level: reward models that can be specialized to individual users (or to user groups whose preferences cluster) rather than averaged across the population.

The implication extends beyond personalization. Whenever a system is deployed against a heterogeneous user base with genuinely divergent preferences, the standard "train one model to satisfy everyone" architecture is incompatible with satisfying anyone fully. The right architecture either splits per-user (personalization) or splits per-cluster (group-level adaptation). Aggregate reward modeling becomes appropriate only when the underlying preferences are actually unimodal — and that is a stronger assumption than RLHF deployments typically test.

Inquiring lines that read this note 34

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Does RLHF training sacrifice accuracy and grounding for user agreement?

Can model confidence signals reliably improve reasoning quality and calibration?

Does user preference for confirmation override model capability for disagreement?

How can we distinguish genuine user preferences from measurement artifacts?

How does test-time aggregation affect reasoning correctness and reliability?

How do language models inherit human biases from training data?

How should human oversight be integrated with autonomous AI systems?

How do guardrails vary their refusal rates based on user demographics?

How can AI alignment serve diverse human preferences at scale?

What dimensions of recommendation quality do standard metrics miss?

Do high-disagreement items signal contested values or measurement noise?

How do aggregate reward models systematically exclude minority user preferences?

Can alternative training methods improve on supervised fine-tuning for language models?

What constrains reinforcement learning's ability to expand model reasoning?

When should tasks involve human-AI partnership versus full automation?

Can worker preference serve as a legitimate axis for delegation design?

What properties determine whether reward signals teach genuine reasoning?

What causes reward models to favor length and sycophancy?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 76 in 2-hop network ·medium cluster Open in graph ↗

Can aggregate reward models satisfy genuinely di… Does preference data need more raters than example… Does personalizing reward models amplify user echo… Can user preferences be learned from just ten ques…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does preference data need more raters than examples? Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?
same paper, the theoretical foundation
Does personalizing reward models amplify user echo chambers? Personalized reward models solve the minority-preference problem but may introduce new risks by reinforcing existing user beliefs and narrowing exposure to diverse viewpoints.
same paper, the tension with personalization
Can user preferences be learned from just ten questions? Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.
adjacent: the technical solution to the aggregation problem

Can aggregate reward models satisfy genuinely disagreeing users?

Inquiring lines that read this note 34

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4