INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›Can alternative training methods i…›this inquiring line

When AI learns preferences from 'pick the better answer,' it captures who won but not by how much — a gap that quietly compounds.

How do binary comparisons constrain reward scale in multi-user preference learning?

This explores a structural limit: when you train reward models on binary 'A beats B' choices instead of absolute scores, you lose information about *how much* better A is — and that lost magnitude becomes a real problem once you're learning from many disagreeing users at once.

This explores a structural limit: binary preference comparisons throw away magnitude. A pairwise 'A is better than B' tells you the sign of a difference but not its size, so the learned reward has an arbitrary scale — and the corpus suggests that arbitrariness, harmless with one user, compounds badly once you're serving many.

The sharpest version of the problem is calibration. When rewards collapse to a binary correct/incorrect signal, models are pushed toward confident guessing because nothing penalizes a confident wrong answer — the reward can't distinguish 'barely right' from 'certainly right.' Does binary reward training hurt model calibration? shows this is provable, and that bolting on a proper scoring rule (Brier score) as a second reward term restores the lost scale by forcing the model to commit to a magnitude, not just a direction. That's the single-user symptom of a scale-free reward.

Now add multiple users, and the missing scale becomes a representational crisis. A single reward model trained on aggregated binary preferences literally cannot encode disagreement: a 51–49 split forces it to either always satisfy the majority or satisfy everyone half the time, because there's no axis on which 'this group feels strongly, that group is indifferent' can be expressed Can aggregate reward models satisfy genuinely disagreeing users?. Binary comparisons flatten intensity into a vote, and votes have no scale. This is why the statistics of preference data don't behave like ordinary data — Does preference data need more raters than examples? shows the error bounds depend on the *number of raters*, not just the number of examples, precisely because each rater carries a different latent scale you can't recover from comparisons alone.

The corpus also suggests the binary signal is noisier than it looks at the source. Annotation clicks aren't all the same thing — they decompose into genuine preferences, non-attitudes, and on-the-spot constructed preferences, and treating them uniformly contaminates the reward Do all annotation responses measure the same underlying thing?. A binary comparison can't tell a deeply held judgment from a coin-flip, which is exactly the scale information multi-user learning needs most.

The interesting move is the corpus's escape routes, which all work by *not* relying on absolute binary labels. POLAR reframes reward modeling as measuring distance from a target policy, eliminating absolute preference labels entirely and getting a continuous scale for free Can reward models learn by comparing policies instead of judging them?. Reward factorization keeps things personal by representing each user as a linear combination of shared base rewards, inferable from about ten adaptive questions — recovering per-user scale without per-user retraining Can user preferences be learned from just ten questions?. And natural-language critiques carry the magnitude and the *why* that a numerical-let-alone-binary reward structurally cannot, which is how they break plateaus where scalar rewards stall Can natural language feedback overcome numerical reward plateaus?. The throughline: binary comparisons constrain reward scale by design, and the fix is always to restore a richer signal — a second scoring term, a distance, a per-user basis, or words.

Sources 7 notes

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can aggregate reward models satisfy genuinely disagreeing users?

Single reward models trained on aggregated preferences cannot represent disagreement. A 51-49 preference split forces a choice between leaving 49% unhappy always or leaving everyone unhappy half the time. This is a representational failure, not a quality problem.

Does preference data need more raters than examples?

Preference data is not i.i.d. across raters with different preferences. PAC bounds for personalized reward models decompose into terms depending on both examples per rater and number of raters, showing rater diversity matters as much as data volume.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can reward models learn by comparing policies instead of judging them?

POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.

Show all 7 sources

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Capturing Individual Human Preferences with Reward Features4.12 match · arxiv ↗
Measuring Human Preferences in RLHF is a Social Science Problem2.56 match · arxiv ↗
Reward Reasoning Model2.51 match · arxiv ↗
Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models2.40 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning1.67 match · arxiv ↗
Direct Preference Optimization: Your Language Model is Secretly a Reward Model1.65 match · arxiv ↗
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning1.65 match · arxiv ↗
Personalized Language Modeling from Personalized Human Feedback1.62 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a preference learning researcher. The question: **do binary comparisons irreversibly constrain reward scale in multi-user settings, or have recent model advances, training methods, or evaluation harnesses relaxed this constraint?**

What a curated library found — and when (dated claims, not current truth):
Findings span 2020–2026; treat these as perishable until you re-test them:
- Binary preferences collapse magnitude to binary signal, forcing confident guessing and breaking calibration; adding a proper scoring rule (Brier score) recovers scale (~2024).
- Aggregated binary reward models **cannot** encode disagreement: a 51–49 user split forces majority-satisfaction or 50–50 compromise, with no axis for intensity (~2024).
- Error bounds in personalized reward learning depend on *number of raters*, not examples alone, because per-rater latent scale is unrecoverable from comparisons (~2024).
- Annotation clicks decompose into genuine preferences, non-attitudes, and constructed preferences; treating uniformly contaminates the signal (~2024).
- Escape routes all restore richer signals: policy discriminators (distance-based rewards), reward factorization (per-user linear combinations), and natural-language critiques (~2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2408.16984 (2024-08) *Beyond Preferences in AI Alignment* — calibration and scale-freedom critique.
- arXiv:2503.06358 (2025-03) *Language Model Personalization via Reward Factorization* — per-user basis recovery.
- arXiv:2507.05197 (2025-07) *Pre-Trained Policy Discriminators are General Reward Models* — policy distance as scale.
- arXiv:2604.03238 (2026-01) *Measuring Human Preferences in RLHF is a Social Science Problem* — annotation decomposition.

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every finding above, judge whether newer models (o1, reasoning-chain inference), training methods (DPO variants, outcome-based RL), tooling (reward caching, multi-agent orchestration), or evaluation (human consistency metrics) have **relaxed or overturned** it since mid-2025. Separate the durable question (e.g., *can binary labels encode user intensity?*) from the perishable limitation (e.g., *scaling proper scoring rules is intractable*). Cite what resolved it.
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Has any paper shown binary comparisons **suffice** for multi-user RL with a clever aggregation or inference trick? Has reward factorization or language feedback hit its own limits?
(3) **Propose 2 research questions** that **assume the regime may have moved**: e.g., *if language critique models now scale to 10k+ users, does the binary-scale constraint even matter?* or *do reasoning-chain LLMs reconstruct user intensity from indirect binary signals?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When AI learns preferences from 'pick the better answer,' it captures who won but not by how much — a gap that quietly compounds.

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8