Can reward models be personalized if annotators lack stable preferences?
This explores a tension: personalizing reward models assumes each user has a stable preference to fit, but if annotators are inconsistent — answering differently on different days — what exactly are we personalizing to?
This explores whether per-user reward modeling survives the discovery that many "preferences" aren't stable signals at all. The most direct answer in the corpus comes from work decomposing annotation responses into three kinds of signal: genuine preferences, non-attitudes (essentially noise — the annotator had no real opinion), and constructed preferences (opinions invented on the spot in response to how the question was framed) Do all annotation responses measure the same underlying thing?. The three are separable by whether they stay consistent across measurement conditions. The implication for personalization is sharp: if you personalize without first sorting these apart, you're fitting a model to each user's noise and momentary framing artifacts, not to anything durable about them.
So the honest answer is conditional — personalization works only to the extent there's a stable signal to recover. Several methods quietly depend on this. Reward factorization (PReF) infers a user's coefficients from as few as ten adaptive questions Can user preferences be learned from just ten questions?, and joint summary-and-reward training (PLUS) learns text descriptions of a user that even transfer to GPT-4 zero-shot Can text summaries beat embeddings for personalized reward models?. Both implicitly assume the answers they elicit reflect a consistent latent preference. If a user is supplying non-attitudes, ten questions will reduce coefficient *uncertainty* without ever locating a real coefficient — the model becomes confidently personalized to nothing.
There's a useful reframing hidden in the abstraction-versus-recall finding: semantic memory (distilled preference summaries) consistently beats episodic memory (replaying specific past interactions) for personalization Does abstract preference knowledge outperform specific interaction recall?. That abstraction step is partly what protects against instability — summarizing across many interactions averages out the one-off, constructed responses and surfaces what actually recurs. In other words, the cure for unstable annotations may be to personalize from patterns over time rather than from individual labels.
The deeper reason personalization is attractive in the first place also bounds how far it should go. Aggregate reward models can't represent genuine disagreement at all — a 51/49 split forces a single centroid that satisfies nobody Can aggregate reward models satisfy genuinely disagreeing users?, and standard BTL models literally average conflicting groups into a policy that optimizes no one's utility, which latent-user methods like VPL try to fix by conditioning on who's asking Do unimodal reward models actually serve all user preferences?. But the same averaging that erases minority groups is also what suppresses an individual's noise. Strip it away and you inherit the recommender-system failure mode: per-user reward models can learn sycophancy and reinforce echo chambers precisely because nothing is pulling them back toward a shared baseline Does personalizing reward models amplify user echo chambers?.
The thing worth walking away with: instability isn't only a data-quality nuisance to be cleaned — it's the boundary line between personalization and pandering. The genuine-preference signal is what *should* be personalized; the constructed and non-attitude signals are exactly the volatility that, when amplified per-user, becomes sycophancy. Telling them apart, not collecting more labels, is the real prerequisite.
Sources 7 notes
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
PRIME framework shows semantic memory (preference summaries, parametric encodings) consistently beats episodic memory (retrieved past interactions) across models. Recency-based recall outperforms similarity-based retrieval, and task fine-tuning exceeds preference tuning methods.
Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.
Standard BTL reward models assume a single utility function, but when preferences are genuinely multi-modal across user groups, maximum-likelihood fitting produces a centroid policy that optimizes nobody's utility. VPL recovers multi-modal distributions using latent user context, enabling user-conditional reward modeling.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.