INQUIRING LINE

What preference dimensions do base reward functions typically capture?

This explores what a 'base reward function' actually encodes — the underlying dimensions of preference that reward models learn to score, and how reductive that single signal turns out to be.


This explores what a 'base reward function' actually encodes — the dimensions of preference a reward model scores against. The cleanest answer in the corpus comes from work that treats preference as factorizable: rather than one monolithic score, you learn a small set of *base* reward functions, each capturing a distinct dimension of what people value, and then represent any individual user as a linear combination of those bases Can user preferences be learned from just ten questions?. The striking finding there is how few dimensions you need — roughly ten well-chosen questions can pin down a user's coefficients — which implies the underlying preference space is low-dimensional and shared, even if each person sits at a different point in it.

But the more interesting story is what standard reward functions *fail* to capture. The conventional Bradley-Terry-Luce reward model assumes a single utility function for everyone, so when real preferences are genuinely multi-modal across groups, maximum-likelihood fitting collapses them into a centroid that optimizes nobody Do unimodal reward models actually serve all user preferences?. The same structural blind spot shows up as a representation problem: a 51-49 split among disagreeing users can't be expressed by one scalar at all, forcing the model to either disappoint the minority always or everyone half the time Can aggregate reward models satisfy genuinely disagreeing users?. So the dimension a base reward function 'typically' captures is, by construction, the *average* — and averaging is exactly where minority and conflicting preferences disappear.

There's a second axis the scalar misses entirely. Human feedback carries two orthogonal kinds of information — evaluative ('how good was this') and directive ('how should it change') — and a reward number captures only the first, discarding the directional content Can scalar rewards capture all the information in agent feedback?. Compounding this, the annotations the reward function is fit to aren't all measuring the same thing: behavioral-science analysis finds genuine preferences, non-attitudes, and on-the-spot constructed preferences mixed together, distinguishable only by how stable they are across conditions Do all annotation responses measure the same underlying thing?. A base reward function trained as if all three were the same signal is learning a blurred composite, not a clean preference dimension.

The responses to this push in two directions worth knowing about. One is to make the function conditional rather than universal — learned *text* summaries of a user's preferences turn out to condition reward models better than embedding vectors, and they capture dimensions zero-shot summaries miss while staying human-readable Can text summaries beat embeddings for personalized reward models?. The other is to abandon absolute preference scoring altogether: reframe the reward model as a *policy discriminator* that scores how close a behavior is to a target policy, which sidesteps the question of fixed preference labels entirely Can reward models learn by comparing policies instead of judging them?.

The quiet warning underneath all of this: the moment you stop averaging and let reward functions capture each user's true dimensions, you also let them learn sycophancy and reinforce echo chambers — the same failure mode recommender systems fell into Does personalizing reward models amplify user echo chambers?. So 'what dimensions a base reward function captures' isn't just a technical question; the averaging that makes it lossy is also, partly, what keeps it from amplifying our worst preferences back at us.


Sources 8 notes

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Do unimodal reward models actually serve all user preferences?

Standard BTL reward models assume a single utility function, but when preferences are genuinely multi-modal across user groups, maximum-likelihood fitting produces a centroid policy that optimizes nobody's utility. VPL recovers multi-modal distributions using latent user context, enabling user-conditional reward modeling.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can text summaries beat embeddings for personalized reward models?

PLUS trains summarizers and reward models jointly, learning that text-based preference summaries capture dimensions zero-shot summaries miss. These summaries transfer to GPT-4 for zero-shot personalization and remain interpretable to users.

Can reward models learn by comparing policies instead of judging them?

POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher auditing constraint dissolution in reward modeling. The question: what preference dimensions do base reward functions actually capture, and has that changed?

What a curated library found — and when (dated claims, not current truth): findings span 2024–2026, mostly post-factorization work.

• Low-dimensional factorization: ~10 orthogonal base reward functions can reconstruct user preferences as weighted linear combinations, suggesting preference space is intrinsically low-dimensional and shared (2025-03).
• Single-scalar Bradley-Terry-Luce reward models collapse multi-modal user preferences into centroid averages, rendering minority and conflicting preferences inexpressible (2024-08, 2024-12).
• Human feedback is two-channel: evaluative ("how good") and directive ("how change"). Scalar rewards capture only evaluation, discarding directional signal (2024-08).
• Annotation responses mix genuine preferences, non-attitudes, and constructed-on-the-spot preferences. Standard reward functions train on all three as one signal, learning blurred composites (2026-01).
• Learned text-based user preference summaries condition reward models better than embeddings and capture dimensions zero-shot summaries miss while remaining human-readable (2025-07).

Anchor papers (verify; mind their dates):
- arXiv:2503.06358 (Language Model Personalization via Reward Factorization, 2025-03)
- arXiv:2504.12522 (Evaluating Diversity and Quality of LLM Generated Content, 2025-04)
- arXiv:2507.05197 (Pre-Trained Policy Discriminators are General Reward Models, 2025-07)
- arXiv:2604.03238 (Measuring Human Preferences in RLHF is a Social Science Problem, 2026-01)

Your task:
(1) RE-TEST EACH CONSTRAINT. For factorization's "~10 dimensions" claim: has this held as model scale and dataset diversity expanded since spring 2025? For the centroid-collapse problem: have newer conditioning mechanisms (text summaries, structured preference representations, post-hoc policy discriminators) genuinely avoided averaging, or do they still optimize a weighted centroid under the hood? Separate the durable insight (multimodality exists; scalar reduction loses information) from the perishable limitation (maybe solvable via conditioning or policy discrimination).
(2) Surface the strongest work in the last 6 months contradicting the "averaging is protective" frame—does personalization empirically amplify echo chambers, or have recent guardrails constrained that failure mode?
(3) Propose 2 research questions that assume the regime may have moved: (a) If text-conditioned or discriminator-based reward models now capture directional+evaluative signal, do they still require social-science decomposition of annotation types, or does end-to-end training dissolve that problem? (b) Can you verify low-dimensionality across *heterogeneous* domains (coding, safety, creative writing, reasoning), or is the 10-dimension finding domain-specific?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines