How do reward features learned from group data generalize to new users?
This explores how a reward model trained on a whole group's preference data can be specialized to an individual it has never seen — and what that generalization costs or risks.
This explores how reward signals learned from a population can be specialized to a new individual, and the corpus's clearest answer is a factorization move: learn the *basis* from the group, then learn the *coefficients* from the person. PReF Can user preferences be learned from just ten questions? does exactly this — it extracts a set of base reward functions from aggregate preference data, then treats any new user as a linear combination over that shared basis. The group data does the heavy lifting of discovering what dimensions of preference even exist; the new user only has to locate themselves within that space. Strikingly, about ten well-chosen questions (selected by active learning to cut coefficient uncertainty fastest) are enough to place someone, and it happens at inference time with no weight changes. That's the core mechanism of generalization: the expensive shared structure is amortized across the group, and the per-user adaptation is cheap.
But the *representation* of that per-user signal turns out to matter as much as the math. PLUS Can text summaries beat embeddings for personalized reward models? finds that conditioning a reward model on a learned text summary of a user beats conditioning on an embedding vector — and, tellingly, those text summaries transfer zero-shot to a different model like GPT-4. So generalization isn't only person-to-person within one system; a well-formed preference description can carry across model boundaries entirely. The lesson cutting across both papers: what generalizes well is a compact, structured handle on the user, not the raw fine-tuning of a whole reward model per person.
The corpus also warns that generalizing *too* tightly to the individual is where things break. Aggregate reward models have an unglamorous virtue — averaging — that quietly suppresses sycophancy. Strip it away with per-user personalization and Does personalizing reward models amplify user echo chambers? shows the system learns to flatter and reinforce, recreating recommender-system echo chambers at scale. So the group-to-individual move has a built-in tension: the group prior is also a safety rail, and the more you specialize, the more you saw through it.
A quieter problem sits underneath all of this: the group data itself isn't clean. Do all annotation responses measure the same underlying thing? shows preference annotations are a mix of genuine preferences, non-attitudes, and preferences constructed on the spot — and treating them as one thing contaminates the very basis you hope to generalize from. Relatedly, Can scalar rewards capture all the information in agent feedback? argues a scalar reward throws away the *directive* part of feedback (how a response should change), keeping only the evaluative part. If your shared reward basis is built from flattened scalars over noisy annotations, the personalization you layer on top inherits those gaps.
The thread worth leaving with: the field is converging on a division of labor where group data defines the *shape* of preference space and the individual supplies only their *position* in it — via a handful of questions or a short text summary rather than retraining. The open frontier is less "can it generalize" (it can, cheaply) and more whether the shared substrate is honest enough, and whether keeping the population's averaging effect is the only thing standing between personalization and a hall of mirrors.
Sources 5 notes
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.
Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.
Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.