Can self-supervised methods replace human annotations for process reward models?
This explores whether models can generate their own step-by-step reasoning rewards instead of relying on humans to hand-label each step — and the corpus says yes, through several different tricks.
This explores whether self-supervised methods can replace human annotation for process reward models (PRMs) — the systems that score each step of a model's reasoning, not just the final answer. Human step-labeling is the expensive bottleneck here, and the collection turns out to have a surprisingly rich set of escape routes, most pointing the same direction: yes, the annotation oracle can be replaced.
The most direct evidence is MetaStone-S1's self-supervised PRM, which dynamically weights its own pseudo-labels and reaches o3-mini-level results with no human step annotation at all Can self-supervised process rewards replace human annotation?. But the more interesting story is *how many different structural cues* can stand in for a human labeler. One family exploits the geometry of the reasoning itself: tree-search rollouts compare sibling branches to turn a single right/wrong outcome into per-step preferences Can tree structure alone convert outcome rewards into process supervision?, and more broadly, trajectory structure — tree topology, expert-aligned actions, tool-call positions — gets mined for dense signal across several methods Can trajectory structure replace hand-annotated process rewards?. AlphaLLM pushes this furthest, using Monte Carlo tree search plus critic models to manufacture feedback equivalent to human labels Can tree search replace human feedback in LLM training?.
A second family skips structure and reads signal straight out of the model's own internal state. Information-theoretic rewards use PAC-Bayes and Fisher information to measure how much each step contributes to getting the answer right Can we reward reasoning steps without human annotation?. ΔBelief-RL goes even leaner — it tracks how much each turn shifts the model's own probability toward the solution, assigning credit with no critic network and no PRM whatsoever Can an agent's own beliefs guide credit assignment without critics?. And reverse-curriculum RL gets process-level granularity from pure outcome feedback by sliding the start of the reasoning backward step by step until failure modes reveal themselves Can curriculum learning approximate expensive process supervision?.
The quietly radical idea is to dissolve the separate reward model entirely. Post-completion learning trains the model to evaluate its own work in the unused sequence space after its answer, internalizing the reward function at zero inference cost Can models learn to evaluate their own work during training?, while self-play loops manufacture the missing feedback through a challenger-and-judge setup that needs no human in the loop Can language models learn skills without human supervision?. The thread connecting all of these: the supervision was always latent in the rollouts, the model's beliefs, or the task structure — humans were just one way to read it out.
What to keep in mind before declaring victory: these methods are validated mostly on verifiable domains (math, tool use, games) where correctness is crisp, and MetaStone-S1's own caveat is that generalization to fuzzy-outcome domains stays unproven. The corpus also hints the reward signal itself needs care — making reward models *reason* before scoring raises their ceiling Can reward models benefit from reasoning before scoring?, decomposing fuzzy instructions into verifiable checklists rescues subjective tasks Can breaking down instructions into checklists improve AI reward signals?, and naive binary rewards quietly wreck calibration unless you patch them Does binary reward training hurt model calibration?. So the honest answer is: self-supervision can replace human annotation wherever the outcome is checkable — which is most of where PRMs are used today — but the frontier is the fuzzy domains where there's no structural shortcut to mine.
Sources 12 notes
MetaStone-S1's SPRM achieves o3-mini-level results using dynamic weighting of pseudo-labels instead of human-annotated steps. This eliminates the annotation bottleneck for process supervision, though generalization to fuzzy-outcome domains remains unproven.
Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
L2T uses PAC-Bayes bounds and Fisher information to compute per-episode rewards measuring each step's contribution to correctness. This annotation-free approach matches dense feedback quality while eliminating the cost of outcome-only methods that produce 2x excess tokens.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.