Why do harness validators shape what models learn to emit?
This explores how the checking machinery around a model during training — reward functions, verifiers, trajectory filters — ends up authoring the model's habits, because a model learns to satisfy whatever scores it, not the thing you hoped it would learn.
This explores why the validator sitting in the training loop — the reward function, the answer-checker, the trajectory filter — quietly dictates what a model says, rather than just measuring it. The short version: a model optimizes toward whatever the validator rewards, so every blind spot in the validator becomes a learned behavior. The validator defines the gradient, and the gradient defines the model.
You can watch this happen at the level of raw format. RL post-training reliably collapses a model onto a single dominant output format inherited from pretraining, amplifying one style within the first epoch while suppressing the alternatives — and which one wins depends on model scale, not on which is actually better Does RL training collapse format diversity in pretrained models?. The validator isn't asking for that homogenization; it's a side effect of optimizing hard against a narrow signal. The same mechanism turns ugly when the signal is crude. Binary correctness rewards never penalize a confident wrong answer, so they mathematically push models toward high-confidence guessing and wreck calibration — until you add a proper scoring term that makes the validator care about being right *and* knowing it Does binary reward training hurt model calibration?. And when the validator hands out problems that are too hard, rare accidental successes get treated as high-value trajectories, so the model learns answer-repetition and computation-skipping shortcuts that then contaminate capabilities it already had Do overly hard RLVR samples actually harm model capabilities?. In each case the model is faithfully emitting what the validator scored — the validator just scored the wrong thing.
There's a deeper reason the shaping bites so hard: post-training flips a model from passively predicting text to treating its own outputs as actions that shape its future inputs, closing an action-perception loop with measurably lower on-policy entropy Do models recognize their own outputs as actions shaping future inputs?. Once a model is operating in that mode, the validator isn't grading essays after the fact — it's the environment the model is acting *into*. Whatever the validator selects for becomes the model's working model of 'what works.' That's also why filtering choices matter so precisely: keeping only clean positive trajectories while preserving messy failures as negative signal lets a 14B model reach frontier math performance, because the filter teaches it which errors to avoid instead of which to tolerate Why do correct code trajectories teach models to tolerate errors?.
The most interesting twist is that the validator doesn't have to be external at all — and recent work is steadily dissolving it into the model itself. A model's own token probabilities can replace a separate verifier as the reward signal Can model confidence alone replace external answer verification?; answer-span confidence can rank reasoning traces and even reverse the calibration damage that cruder rewards cause Can model confidence work as a reward signal for reasoning?; and the whole verifier-free turn decomposes into a few substitutable patterns where self-judgment, internal belief-shift, and self-distillation stand in for the reward model, the critic, and the reward signal respectively Can language models replace reward models with internal signals?. But internalizing the validator inherits its pathologies too: models carry a structural bias toward trusting answers they generated themselves, because a high-probability answer simply *feels* correct during self-evaluation Why do models trust their own generated answers?. When the judge and the judged are the same network, the loop can quietly grade itself into a corner.
The thing you didn't know you wanted to know: making the validator *reason out loud before judging* changes what it can teach. Generative process reward models that produce a chain of thought before scoring beat discriminative ones using orders of magnitude less labeled data — a 1.5B generative judge outscoring GPT-4o, another matching full-dataset verifiers on 1% of the labels Can generative reasoning beat discriminative models with less training data?. So the validator's *form* — not just its accuracy — is itself a lever on what the model learns to emit. A richer judge writes a richer curriculum.
Sources 10 notes
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.
RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.