Does preference data need more raters than examples?
Pairwise preference data violates the i.i.d. assumption because preferences vary across raters. Does this mean PAC bounds for reward models depend on rater diversity rather than just sample size?
Standard PAC learning theory assumes training data is independently and identically distributed. Reward models trained on aggregated human preferences quietly violate this assumption: examples come from raters whose preferences differ systematically, so the data is not i.i.d. across raters even if it appears so within each rater. Capturing Individual Human Preferences with Reward Features derives the resulting PAC bound and shows it has a different shape than the standard one — approximation error depends on the number of raters who provided feedback, not just the number of examples.
This is the theoretical foundation that empirical reward-factorization work like PReF lacked. PReF showed that 10-20 active-learning queries suffice for per-user personalization given a base set of reward features. The why behind that result was operational. The PAC bound provides the formal account: when reward features are linear combinations learned from group data, the generalization error to a new user decomposes into a term that depends on examples per rater and a separate term that depends on how many raters contributed to feature learning. Both terms matter; both can be optimized.
The methodological consequence is sharp. Standard practice in RLHF data collection optimizes for example count — more pairwise preferences per rater, more raters annotating the same examples for inter-rater reliability. The PAC bound argues for a different allocation: when preferences disagree (high-disagreement tasks like creative writing, subjective evaluation, value-laden topics), more raters with fewer examples each beats fewer raters with many examples each. The features needed to span the preference space require diversity in the rater axis, not just depth in the example axis.
For builders, this changes how reward-model data collection should be structured for personalization. Generic single-distribution reward models can be trained with concentrated rater pools. Adaptive reward models need broad rater pools and structured feature-learning even at lower per-rater example counts.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do self-generated preference pairs from a strong teacher compare to human feedback?
- How can consistency across measurement conditions identify genuine versus constructed preferences?
- Why do online ratings fail to represent independent individual preferences?
- What consistency tests could distinguish constructed from genuine preferences?
- Why do strong-opinion raters dominate public rating distributions?
- What makes minority preferences disappear in aggregated single-distribution reward models?
- Why does preference measurement validity matter more than aggregation methods?
- How do binary comparisons constrain reward scale in multi-user preference learning?
- Why does single-reward RLHF fail to represent diverse human preferences?
- How much does preference data freshness matter compared to data source in DPO?
- Why does preference measurement validity matter before any aggregation?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can user preferences be learned from just ten questions?
Explores whether adaptive question selection can efficiently infer user-specific reward coefficients without historical data or fine-tuning. This matters for scaling personalization without per-user model updates.
same conceptual framework: this note provides the theoretical PAC foundation that PReF's empirical efficiency demonstrates
-
Can aggregate reward models satisfy genuinely disagreeing users?
When users have conflicting preferences, do aggregate reward models face an impossible choice between satisfying majorities or sampling proportionally? What does this reveal about RLHF deployment?
same paper, the consequence of treating preferences as i.i.d.
-
Can text summaries beat embeddings for personalized reward models?
When training reward models on diverse user preferences, does conditioning on learned text-based summaries of user preferences outperform embedding vectors? This matters because better representations could make personalization more interpretable and portable.
adjacent: a different mechanism for personalized alignment
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Capturing Individual Human Preferences with Reward Features
- Measuring Human Preferences in RLHF is a Social Science Problem
- Information-Theoretic Reward Decomposition for Generalizable RLHF
- Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
- Personalized Language Modeling from Personalized Human Feedback
- Checklists Are Better Than Reward Models For Aligning Language Models
- Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
- RLHF Workflow: From Reward Modeling to Online RLHF
Original note title
PAC bound for personalized reward models depends on number of raters not just number of examples — preference data is not iid so traditional sample-complexity bounds undercount the relevant axis