INQUIRING LINE

How do reward models as policy discriminators differ from labeled preferences?

This explores POLAR's reframing of reward modeling — scoring how close a policy sits to a target rather than judging answers against human preference labels — and what that shift buys you over the standard labeled-preference approach.


This explores how reward models can be rebuilt as *policy discriminators* — systems that measure distance from a target policy — instead of the usual recipe of training on human-labeled preference pairs. The cleanest statement of the idea is POLAR Can reward models learn by comparing policies instead of judging them?, which flips the question. A conventional reward model learns absolute judgments: 'this answer is better than that one,' grounded in human labels. POLAR instead asks 'how similar is this policy to a chosen target?' and assigns higher scores to closer policies. That sidesteps absolute preference labels entirely, and the payoff is transfer — pretrained discriminators in the 1.8B–7B range beat non-pretrained methods and carry across task formulations rather than being welded to one labeling scheme.

Why would you want to escape labels at all? Because labels are noisier than they look. One striking finding is that annotation responses aren't a single signal — they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, distinguishable only by whether they stay consistent across measurement conditions Do all annotation responses measure the same underlying thing?. Treat them all as ground truth and you contaminate the reward model. A discriminator that measures distance-from-target inherits the quality of its target policy instead of the quality of a crowd's momentary clicks, which is a different — and often cleaner — failure surface.

The deeper theme across the corpus is that the *scalar preference label* is a lossy container. Agent feedback, for instance, carries two orthogonal things at once: evaluative information (how good was this?) and directive information (how should it change?), and a single scalar reward can only hold the first Can scalar rewards capture all the information in agent feedback?. Other notes route around labels in their own ways: test-time RL manufactures reward from majority vote across samples with no labels at all Can models improve themselves using only majority voting?, and ΔBelief-RL turns the agent's own shifting confidence toward a solution into dense per-step credit, no preference annotations or critic networks required Can an agent's own beliefs guide credit assignment without critics?. Discrimination-from-target is one member of this larger family: ways to get a training signal without asking humans to rank outputs.

There's also a personalization angle worth knowing. The averaging that aggregate preference labels perform is quietly protective — it washes out individual quirks. Specialize a reward model per user and you remove that buffer, and the system can learn sycophancy and harden echo chambers, exactly the way recommender systems do Does personalizing reward models amplify user echo chambers?. A target-policy discriminator reframes 'whose preferences' as 'which target,' which makes that choice explicit rather than hidden inside a label distribution.

If you want to go further, two notes complicate the picture in useful directions: reward models that *reason* before scoring raise their own capability ceiling beyond outcome-only evaluation Can reward models benefit from reasoning before scoring?, and binary correctness rewards — the simplest labels of all — provably wreck calibration unless you bolt on a proper scoring rule Does binary reward training hurt model calibration?. Together they suggest the interesting frontier isn't 'discriminator vs. labels' but how much structure you let the reward signal carry.


Sources 8 notes

Can reward models learn by comparing policies instead of judging them?

POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a critical analyst of reward modeling and preference learning in LLMs. The question remains open: how do policy discriminators (systems measuring distance from a target) differ fundamentally from labeled-preference reward models, and does one regime subsume the other?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026, reflecting rapid evolution:

• Conventional reward models train on human-labeled preference pairs; policy discriminators instead measure distance-from-target and show transfer across task formulations in 1.8B–7B parameter ranges (2025-07, arXiv:2507.05197).
• Annotation responses decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences — treating all as ground truth contaminates the reward signal (no single year, embedded in corpus).
• Agent feedback carries two orthogonal signals (evaluative: 'how good?' and directive: 'how to change?') that a scalar reward cannot cleanly separate (2025-07, arXiv:2507.14843).
• Test-time RL manufactures reward from majority vote across unlabeled samples, bypassing preference annotation entirely (2025-04, arXiv:2504.16084).
• Personalized reward models risk amplifying sycophancy and echo chambers by removing the protective averaging that aggregate labels provide (2026-01, arXiv:2604.03238).

Anchor papers (verify; mind their dates):
• arXiv:2507.05197 (2025-07): Pre-Trained Policy Discriminators are General Reward Models
• arXiv:2604.03238 (2026-01): Measuring Human Preferences in RLHF is a Social Science Problem
• arXiv:2505.14674 (2025-05): Reward Reasoning Model
• arXiv:2504.16084 (2025-04): TTRL: Test-Time Reinforcement Learning

Your task:
(1) RE-TEST THE TRANSFER CLAIM. The 2025-07 finding that pretrained discriminators beat non-pretrained baselines across formulations — has this held as model scale increased (post-2025-07)? Does the constraint still bind when target policies themselves are noisy or misaligned? Separately: is the 'distance-from-target' regime truly orthogonal to label quality, or does it merely defer the preference problem to target selection?
(2) Surface the strongest work from the last ~6 months that contradicts or supersedes the discriminator-vs.-labels framing. Pay special attention to papers that *integrate* reasoning into rewards (2025-05 RM-R1, Reward Reasoning Model) — do they collapse the distinction or sharpen it?
(3) Propose 2 research questions that assume the regime has shifted: (a) If reward reasoning + proper scoring rules now solve calibration (as 2025-05 hints), can you build a discriminator that reasons about target-policy distance? (b) Does the personalization risk (2026-01) actually flip when you switch from labeled aggregates to explicit targets — and if so, how do you audit target-policy bias?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines