SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can reward models learn by comparing policies instead of judging them?

What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?

Synthesis note · 2026-05-18 · sourced from Reinforcement Learning
How well do reward models actually evaluate AI reasoning?

Traditional reward modeling presupposes an absolute preference: humans rank responses, the RM learns "good" vs "bad" in that absolute frame, the policy optimizes against that signal. The reliance on manually-defined preferences is exactly what limits scale — every new task domain demands new preference data.

POLAR (2507.05197) redefines what an RM is. Instead of an absolute preference predictor, treat the RM as a policy discriminator: given a candidate policy and a target policy, quantify the difference. Higher scores go to policies more similar to the target. The reward signal guides the training policy toward desired behaviors without ever encoding what those behaviors should be in absolute terms.

The shift is consequential. Since target policies can be arbitrarily chosen, the objective becomes criterion-agnostic — it applies to any scenario where you can describe the desired policy by demonstration rather than by attribute. This eliminates the bottleneck of preference annotation and creates a scalable pre-training paradigm for RMs. Train once on policy discrimination; reuse across many task formulations by varying the target.

The empirical claim is strong: POLAR RMs at 1.8B-7B parameters substantially outperform traditional non-pre-trained methods, significantly enhancing RM performance downstream. The relative framing makes the RM transferable in a way absolute-preference RMs are not.

The deeper move is conceptual. A reward model is not a value judgment — it is a similarity measure to a chosen reference. This connects to Can models learn what makes research worth doing?: both treat reward as a relational construct (similarity to a reference, ranking within a community) rather than an absolute property. The dominant RLHF paradigm trained RMs to encode "what humans want" — POLAR trains them to encode "how close are you to this." The latter scales because it admits any reference policy as target.

A concrete consequence: POLAR fits naturally into the verifier-free RL pattern emerging in late-2025 work. When the target policy is given (e.g., as a demonstration set), no manual preference labels are needed. This is the same move RARO makes via adversarial IRL — both reject the labeled-preference bottleneck — but POLAR's relative framing is general-purpose where RARO is adversarial.

Inquiring lines that use this note as a source 13

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
15 direct connections · 91 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reward models redefined as policy discriminators measure distance from a target policy — criterion-agnostic and scalable