INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›How do aggregate reward models sys…›this inquiring line

When you train one AI on what most people prefer, the minority isn't just outvoted — it gets erased by design.

How do aggregate reward models systematically exclude minority perspectives?

This explores why training a single reward model on pooled human preferences doesn't just *underweight* minority views — it structurally erases them, and what the corpus offers as alternatives.

This explores why training a single reward model on pooled human preferences doesn't just underweight minority views but structurally erases them. The cleanest way to see the problem is a thought experiment from the corpus: imagine users split 51-49 on what makes a good answer. A single aggregate reward model has to pick one winner, so it either leaves the 49% unhappy every single time, or it splits the difference and leaves *everyone* unhappy half the time Can aggregate reward models satisfy genuinely disagreeing users?. There's no setting of one model that satisfies people who genuinely disagree. This is a representational failure, not a tuning bug — averaging over disagreement doesn't find a consensus, it manufactures a fictional median user no one actually is.

That intuition has been proven formally. MaxMin-RLHF shows that fitting one reward model to aggregated preferences provably cannot represent diverse populations equitably — the math guarantees minority viewpoints get silently absorbed into the majority signal Can a single reward model represent diverse human preferences?. The proposed escape borrows from social choice theory: learn a *mixture* of preference distributions rather than one blended reward, then optimize a MaxMin objective that explicitly protects the worst-off group instead of maximizing the average. The exclusion isn't an accident of bad data — it's baked into the act of collapsing many preferences into one scalar.

Here's the part that should give you pause: the obvious fix — personalize the reward model per user so no one gets averaged away — backfires. Strip out the aggregate's averaging effect and you remove the only thing dampening sycophancy, and systems learn to flatter each user and reinforce their existing views, recreating the polarization dynamics of recommender feeds Does personalizing reward models amplify user echo chambers?. So aggregation excludes minorities, but naive personalization manufactures echo chambers. The real design space lives between those two failure modes, not at either pole.

The same averaging pathology shows up far outside RLHF, which is the tell that this is structural. Accuracy-optimized recommenders systematically crowd out minority interests by over-weighting whatever dominates a user's history — and notably, the fix there isn't retraining but *post-hoc reranking* that enforces proportional representation as a calibration constraint Why do accuracy-optimized recommenders crowd out minority interests?. Ranking systems show the mechanism that makes it worse over time: without explicitly modeling selection bias, models converge on degenerate equilibria that amplify their own past decisions in a feedback loop Why do ranking systems need to model selection bias explicitly?. Minority exclusion isn't static — each training round trains on data shaped by the last round's majority bias.

One deeper thread worth pulling: part of why a single reward model has to choose a winner is that it compresses everything into one number. Feedback actually carries two separable signals — *evaluative* (how good was this) and *directive* (how should it change) — and a scalar reward keeps the first while discarding the second Can scalar rewards capture all the information in agent feedback?. A richer feedback representation that preserves the directional information might let a system hold multiple legitimate preferences open rather than forcing them into a single ranking — suggesting the exclusion problem is partly downstream of how thin the scalar-reward bottleneck is.

Sources 6 notes

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Can a single reward model represent diverse human preferences?

MaxMin-RLHF proves an impossibility result: fitting one reward model to aggregated preferences silently erases minority viewpoints. The solution is learning a mixture of preference distributions and optimizing a MaxMin objective from social choice theory to protect the worst-off groups.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Why do accuracy-optimized recommenders crowd out minority interests?

Accuracy-optimized models systematically miscalibrate by over-weighting dominant user interests. A post-processing reranking algorithm that enforces calibration constraints can restore proportional representation without retraining the underlying model.

Why do ranking systems need to model selection bias explicitly?

YouTube's multi-objective ranker uses MMoE for conflicting objectives and a shallow position tower to remove selection bias from training data. Without both mechanisms, models converge on degenerate equilibria that amplify their own past decisions.

Show all 6 sources

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Capturing Individual Human Preferences with Reward Features2.57 match · arxiv ↗
Measuring Human Preferences in RLHF is a Social Science Problem2.55 match · arxiv ↗
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models2.39 match · arxiv ↗
Calibrated Recommendations1.66 match · arxiv ↗
Beyond Preferences in AI Alignment1.66 match · arxiv ↗
Self-Improving Model Steering1.65 match · arxiv ↗
Information-Theoretic Reward Decomposition for Generalizable RLHF1.59 match · arxiv ↗
Foundations of Large Language Models1.57 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining minority representation in LLM reward systems. The question: How do aggregate reward models structurally exclude minority perspectives—and has this constraint been relaxed or overturned since 2023–26?

What a curated library found—and when (dated claims, not current truth):
Findings span 2023–26 and include:
• Single aggregate reward models provably cannot represent diverse populations equitably; collapsing disagreement into one scalar manufactures a fictional median user (MaxMin-RLHF, 2024-02).
• Naive personalization (per-user reward models) backfires by amplifying sycophancy and echo chambers, recreating recommender-feed polarization (2024–25 work).
• Accuracy-optimized recommenders systematically crowd out minority interests via over-weighting dominant history; the fix is post-hoc reranking enforcing proportional representation, not retraining (2023–24).
• Ranking systems converge on degenerate equilibria that amplify past majority bias in feedback loops unless selection bias is explicitly modeled (2024–25).
• Scalar reward bottleneck discards directional (change) signal, forcing preferences into a single ranking rather than holding multiple legitimate views open (inference from 2024–25 work).

Anchor papers (verify; mind their dates):
• arXiv:2402.08925 (MaxMin-RLHF, 2024-02)
• arXiv:2307.15142 (Accuracy-diversity trade-off, 2023-07)
• arXiv:2503.17338 (Reward Features, 2025-03)
• arXiv:2604.03238 (Human Preferences as Social Science, 2026-01)

Your task:
(1) RE-TEST EACH CONSTRAINT. For the aggregation dilemma (51–49 split forcing either systematic exclusion or compromise unhappiness), check whether test-time methods (e.g., TTRL, 2025-04), reward reasoning (2025-05), or token-level reasoning (DRO, 2026-02) have created escape hatches—ways to defer or pluralize the decision boundary past training. Separately: has richer feedback representation (e.g., reward features, 2025-03) or explicit credit assignment (2026-02) dissolved the scalar bottleneck, letting systems hold minority signals live through training? State plainly where the core representational failure still holds.

(2) Surface the strongest work from late 2025–26 that contradicts or supersedes the "no fix except staying between aggregation and personalization" framing.

(3) Propose 2 research questions that assume the regime may have moved: (a) Can test-time reasoning over multi-modal preference signals preserve minority views without retraining? (b) Does reasoning-based reward modeling (RM-R1, R1 reasoning) decouple preference aggregation from preference *ranking*, allowing a system to honor disagreement structurally?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When you train one AI on what most people prefer, the minority isn't just outvoted — it gets erased by design.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8