What are the actual limits of sibling comparison versus trained process reward models?
This explores the tradeoff between two ways to get step-by-step (process) feedback during reasoning training: deriving it for free from the structure of a search tree (comparing 'sibling' branches that share a starting point) versus training a separate reward model to judge each step — and where each approach runs out of road.
This explores the tradeoff between getting process feedback 'for free' from tree structure versus paying to train a dedicated judge. The cheap path is sibling comparison: in Tree-GRPO, you branch a reasoning trajectory at multiple points, then compare subtrees that diverged from the same node. Because the siblings share everything up to the branch point, the difference in their outcomes localizes credit to the step where they split — turning a single trajectory-level success/failure signal into step-level preference data with no annotation and no separate model Can tree structure alone convert outcome rewards into process supervision?. The corpus shows this isn't a one-off trick: trajectory *structure* in general — tree topology, expert-aligned actions, tool-call positions — can be mined for dense step signals, and MCTS variants like AlphaLLM use search outcomes plus critics to manufacture process-quality signals that rival human labels Can trajectory structure replace hand-annotated process rewards? Can tree search replace human feedback in LLM training?.
The limit of sibling comparison is hiding in plain sight: it can only tell you that one branch *led to* a better outcome, not *why* a step was good or bad. The comparison signal is still grounded in the final outcome reward — it just redistributes that outcome across steps. So when a model plateaus, sibling comparison redistributes the same impoverished information. That's exactly the gap Critique-GRPO names: numerical rewards (which is what outcome-derived step signals ultimately are) lack the information about *why* a failure happened, and natural-language critiques can break plateaus that more numerical reward cannot Can natural language feedback overcome numerical reward plateaus?.
This is where trained process reward models earn their cost — but the surprising finding is *which* trained PRMs are worth it. Discriminative PRMs that simply classify steps as good/bad are largely beaten by *generative* judges that reason about the reasoning before scoring. StepWiser, GenPRM, and ThinkPRM all show that a judge producing a chain-of-thought about each step is more accurate and dramatically more data-efficient — a 1.5B GenPRM beats GPT-4o, and ThinkPRM matches full-dataset verifiers using 1% of the labels Can judges that reason about reasoning outperform classifier rewards? Can generative reasoning beat discriminative models with less training data?. The same 'reason first, score second' move lets reward models scale test-time compute and raises their capability ceiling beyond what any outcome-derived signal reaches Can reward models benefit from reasoning before scoring?. So the real axis isn't free-vs-trained — it's *outcome-grounded* signal (which both sibling comparison and discriminative PRMs ultimately are) versus *explanatory* signal that carries information the outcome never contained.
Two further limits reframe the whole comparison. First, there may be a ceiling on what *any* of these methods can produce: RLVR research suggests reward learning mostly activates reasoning strategies already latent in the pretrained model rather than teaching genuinely new skills — spurious rewards work nearly as well as correct ones for well-pretrained models What does reward learning actually do to model reasoning?. If true, neither sibling comparison nor a lovingly-trained PRM expands the frontier; they reallocate sampling efficiency within it, and the simpler method may be the rational choice. Second, trained reward models carry failure modes the structural approach sidesteps: binary correctness rewards quietly degrade calibration by rewarding confident wrong answers Does binary reward training hurt model calibration?, and converting rubric scores into dense rewards invites reward hacking unless rubrics are used as gates rather than as the reward itself Can rubrics and dense rewards work together without hacking?.
If you want the genuinely different lens, look at the approaches that dodge the dichotomy entirely: POLAR reframes reward modeling as measuring *distance from a target policy* rather than judging steps in the abstract Can reward models learn by comparing policies instead of judging them?, and Post-Completion Learning trains the model to internalize self-evaluation so the judge disappears at inference time Can models learn to evaluate their own work during training?. The honest summary: sibling comparison is bounded by the fact that its signal is still just the outcome reward wearing a step-level costume, while trained PRMs buy *explanatory* signal at the price of new failure modes — and a live open question is whether either one moves the frontier or merely mines it more efficiently.
Sources 12 notes
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
POLAR reframes reward modeling as policy discrimination: RMs assign higher scores to policies similar to a chosen target, eliminating absolute preference labels. Pre-trained 1.8B-7B parameter POLAR RMs substantially outperform non-pre-trained methods and transfer across task formulations.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.