Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
Traditional reward modeling presupposes an absolute preference: humans rank responses, the RM learns "good" vs "bad" in that absolute frame, the policy optimizes against that signal. The reliance on manually-defined preferences is exactly what limits scale — every new task domain demands new preference data.
POLAR (2507.05197) redefines what an RM is. Instead of an absolute preference predictor, treat the RM as a policy discriminator: given a candidate policy and a target policy, quantify the difference. Higher scores go to policies more similar to the target. The reward signal guides the training policy toward desired behaviors without ever encoding what those behaviors should be in absolute terms.
The shift is consequential. Since target policies can be arbitrarily chosen, the objective becomes criterion-agnostic — it applies to any scenario where you can describe the desired policy by demonstration rather than by attribute. This eliminates the bottleneck of preference annotation and creates a scalable pre-training paradigm for RMs. Train once on policy discrimination; reuse across many task formulations by varying the target.
The empirical claim is strong: POLAR RMs at 1.8B-7B parameters substantially outperform traditional non-pre-trained methods, significantly enhancing RM performance downstream. The relative framing makes the RM transferable in a way absolute-preference RMs are not.
The deeper move is conceptual. A reward model is not a value judgment — it is a similarity measure to a chosen reference. This connects to Can models learn what makes research worth doing?: both treat reward as a relational construct (similarity to a reference, ranking within a community) rather than an absolute property. The dominant RLHF paradigm trained RMs to encode "what humans want" — POLAR trains them to encode "how close are you to this." The latter scales because it admits any reference policy as target.
A concrete consequence: POLAR fits naturally into the verifier-free RL pattern emerging in late-2025 work. When the target policy is given (e.g., as a demonstration set), no manual preference labels are needed. This is the same move RARO makes via adversarial IRL — both reject the labeled-preference bottleneck — but POLAR's relative framing is general-purpose where RARO is adversarial.
Inquiring lines that use this note as a source 13
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How should preference channels from historical sessions inform unified policy learning?
- Why do reward models learn surface-level shortcuts instead of genuine quality assessment?
- How does modularity in reward and policy design enable goal generalization?
- What preference dimensions do base reward functions typically capture?
- How do reward models as policy discriminators differ from labeled preferences?
- How do adversarial IRL and policy discrimination differ in rejecting preference labels?
- How do relational reward signals compare to absolute preference encodings in RL?
- What makes policy discrimination scalable where preference annotation hits bottlenecks?
- Do personalized reward models work better than one-size-fits-all approaches?
- What are the actual limits of sibling comparison versus trained process reward models?
- Can variational inference recover user-specific reward models from preference comparisons?
- What makes reward models fundamentally different from policy discriminators?
- How do binary comparisons constrain reward scale in multi-user preference learning?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models learn what makes research worth doing?
Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.
both reframe reward as relational: similarity-to-target (POLAR) vs ranking-within-community (RLCF) — neither requires absolute preference labels
-
Can adversarial critics replace task-specific verifiers for reasoning?
Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
RARO uses adversarial discrimination against demonstrations; POLAR uses similarity to a target policy; same anti-labeled-preference move, different mechanism
-
Can generative reasoning beat discriminative models with less training data?
Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.
generative PRMs add reasoning before judging; POLAR adds relative framing — orthogonal axes of RM improvement
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Pre-Trained Policy Discriminators are General Reward Models
- Reward Reasoning Model
- Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
- Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards
- RM-R1: Reward Modeling as Reasoning
- Direct Preference Optimization: Your Language Model is Secretly a Reward Model
- Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning
- Capturing Individual Human Preferences with Reward Features
Original note title
reward models redefined as policy discriminators measure distance from a target policy — criterion-agnostic and scalable