SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can reward models benefit from reasoning before scoring?

Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Test-time compute scaling has been studied extensively for generation — but three independent research teams have simultaneously discovered it applies equally to evaluation. Reward Reasoning Models (RRMs), RM-R1, and DeepSeek-GRM all converge on the same insight: reward modeling is a reasoning task, and allowing the evaluator to "think" before scoring produces better rewards.

RRMs (2025) use RL to foster self-evolved reward reasoning without requiring explicit reasoning traces as training data. The model generates a chain-of-thought reasoning process before producing final rewards, adaptively allocating compute to queries where appropriate rewards are not immediately apparent. Multi-response strategies (ELO rating, knockout tournament) enable flexible test-time compute scaling. Crucially, RRMs develop distinct reasoning patterns from untrained foundation models — the training successfully reshapes how the model approaches evaluation.

RM-R1 introduces Chain-of-Rubrics (CoR) — the model first categorizes input as "chat" or "reasoning," then follows different evaluation strategies. Chat tasks get self-generated rubrics, justifications, and evaluations. Reasoning tasks get solve-first-then-evaluate. This task-type perception enables tailored reward generation. The training pipeline combines reasoning distillation prior to RLVR — distillation alone is insufficient, and RLVR alone fails to fully realize reasoning capabilities. Both stages are needed.

DeepSeek-GRM uses Self-Principled Critique Tuning (SPCT) via rule-based online RL to generate principles adaptively per query-response pair, then critique against those principles. Parallel sampling generates diverse principle-critique sets, enabling finer-grained reward resolution with larger compute budgets. A meta RM further guides the voting process for better scaling performance.

The convergence matters because it identifies a bottleneck that was hiding in plain sight: the evaluator's capability ceiling constrains the entire alignment pipeline. Since Does the choice of RL algorithm actually matter for reasoning?, the prior-bounded ceiling applies to reward models too — but reasoning-enabled reward models raise that ceiling by allocating compute adaptively.

Inquiring lines that use this note as a source 110

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 146 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reward reasoning models extend test-time compute scaling to reward evaluation by producing reasoning traces before scoring