SYNTHESIS NOTE

Can reward models learn by comparing policies instead of judging them?

What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?

Synthesis note · 2026-05-18 · sourced from Reinforcement Learning

Traditional reward modeling presupposes an absolute preference: humans rank responses, the RM learns "good" vs "bad" in that absolute frame, the policy optimizes against that signal. The reliance on manually-defined preferences is exactly what limits scale — every new task domain demands new preference data.

POLAR (2507.05197) redefines what an RM is. Instead of an absolute preference predictor, treat the RM as a policy discriminator: given a candidate policy and a target policy, quantify the difference. Higher scores go to policies more similar to the target. The reward signal guides the training policy toward desired behaviors without ever encoding what those behaviors should be in absolute terms.

The shift is consequential. Since target policies can be arbitrarily chosen, the objective becomes criterion-agnostic — it applies to any scenario where you can describe the desired policy by demonstration rather than by attribute. This eliminates the bottleneck of preference annotation and creates a scalable pre-training paradigm for RMs. Train once on policy discrimination; reuse across many task formulations by varying the target.

The empirical claim is strong: POLAR RMs at 1.8B-7B parameters substantially outperform traditional non-pre-trained methods, significantly enhancing RM performance downstream. The relative framing makes the RM transferable in a way absolute-preference RMs are not.

The deeper move is conceptual. A reward model is not a value judgment — it is a similarity measure to a chosen reference. This connects to Can models learn what makes research worth doing?: both treat reward as a relational construct (similarity to a reference, ranking within a community) rather than an absolute property. The dominant RLHF paradigm trained RMs to encode "what humans want" — POLAR trains them to encode "how close are you to this." The latter scales because it admits any reference policy as target.

A concrete consequence: POLAR fits naturally into the verifier-free RL pattern emerging in late-2025 work. When the target policy is given (e.g., as a demonstration set), no manual preference labels are needed. This is the same move RARO makes via adversarial IRL — both reject the labeled-preference bottleneck — but POLAR's relative framing is general-purpose where RARO is adversarial.

Inquiring lines that read this note 14

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do aggregate reward models systematically exclude minority user preferences?

What properties determine whether reward signals teach genuine reasoning?

Why do reward models learn surface-level shortcuts instead of genuine quality assessment?

Can language model RL training avoid reward hacking and misalignment?

Can alternative training methods improve on supervised fine-tuning for language models?

How can process reward models supervise complex reasoning traces?

What are the actual limits of sibling comparison versus trained process reward models?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 94 in 2-hop network ·medium cluster Open in graph ↗

Can reward models learn by comparing policies in… Can models learn what makes research worth doing? Can adversarial critics replace task-specific veri… Can generative reasoning beat discriminative model…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models learn what makes research worth doing? Can large language models be trained to recognize high-impact research directions by learning from citation patterns? This explores whether 'scientific taste'—the judgment of what work matters—is a learnable skill separate from execution.
both reframe reward as relational: similarity-to-target (POLAR) vs ranking-within-community (RLCF) — neither requires absolute preference labels
Can adversarial critics replace task-specific verifiers for reasoning? Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
RARO uses adversarial discrimination against demonstrations; POLAR uses similarity to a target policy; same anti-labeled-preference move, different mechanism
Can generative reasoning beat discriminative models with less training data? Do process reward models that generate reasoning before judging achieve better performance than traditional discriminative approaches when trained on dramatically smaller datasets? This tests whether generative verification can scale more efficiently.
generative PRMs add reasoning before judging; POLAR adds relative framing — orthogonal axes of RM improvement

Can reward models learn by comparing policies instead of judging them?

Inquiring lines that read this note 14

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4