Does preference data need more raters than examples?

Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?

Synthesis note · 2026-05-18 · sourced from Recommenders Personalized

Standard PAC learning theory assumes training data is independently and identically distributed. Reward models trained on aggregated human preferences quietly violate this assumption: examples come from raters whose preferences differ systematically, so the data is not i.i.d. across raters even if it appears so within each rater. Capturing Individual Human Preferences with Reward Features derives the resulting PAC bound and shows it has a different shape than the standard one — approximation error depends on the number of raters who provided feedback, not just the number of examples.

This is the theoretical foundation that empirical reward-factorization work like PReF lacked. PReF showed that 10-20 active-learning queries suffice for per-user personalization given a base set of reward features. The why behind that result was operational. The PAC bound provides the formal account: when reward features are linear combinations learned from group data, the generalization error to a new user decomposes into a term that depends on examples per rater and a separate term that depends on how many raters contributed to feature learning. Both terms matter; both can be optimized.

The methodological consequence is sharp. Standard practice in RLHF data collection optimizes for example count — more pairwise preferences per rater, more raters annotating the same examples for inter-rater reliability. The PAC bound argues for a different allocation: when preferences disagree (high-disagreement tasks like creative writing, subjective evaluation, value-laden topics), more raters with fewer examples each beats fewer raters with many examples each. The features needed to span the preference space require diversity in the rater axis, not just depth in the example axis.

For builders, this changes how reward-model data collection should be structured for personalization. Generic single-distribution reward models can be trained with concentrated rater pools. Adaptive reward models need broad rater pools and structured feature-learning even at lower per-rater example counts.

Inquiring lines that read this note 11

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can alternative training methods improve on supervised fine-tuning for language models?

How can we distinguish genuine user preferences from measurement artifacts?

How do social dynamics and selection effects compound in rating aggregates?

How do aggregate reward models systematically exclude minority user preferences?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

12 direct connections · 82 in 2-hop network ·medium cluster Open in graph ↗

Does preference data need more raters than examp… Can user preferences be learned from just ten ques… Can aggregate reward models satisfy genuinely disa… Can text summaries beat embeddings for personalize…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can user preferences be learned from just ten questions? Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.
same conceptual framework: this note provides the theoretical PAC foundation that PReF's empirical efficiency demonstrates
Can aggregate reward models satisfy genuinely disagreeing users? When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?
same paper, the consequence of treating preferences as i.i.d.
Can text summaries beat embeddings for personalized reward models? When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
adjacent: a different mechanism for personalized alignment

Does preference data need more raters than examples?

Inquiring lines that read this note 11

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4