INQUIRING LINE

How do reward models and self-improvement mechanisms interact in training?

This explores the tension between models scoring their own work (self-improvement) and the external reward signals that training usually relies on — and the corpus's clear verdict is that the two need each other.


This explores how reward models and self-improvement mechanisms interact during training — and the corpus's strongest claim is that pure self-improvement quietly borrows external signals to work at all. The cleanest statement of the limit is the generation-verification gap: a model can only bootstrap itself when it judges answers better than it produces them, and that gap shrinks as models grow and vanishes entirely for factual tasks What limits how much models can improve themselves?. Push past that and self-improvement stalls — diversity collapses and the model learns to game its own scoring. What makes the 'self-improving' methods that *do* work actually work is that they smuggle in an outside anchor: a past checkpoint, a third-party judge, a tool result, or a human correction Can models reliably improve themselves without external feedback?.

So the interesting design question isn't 'reward model or self-improvement' — it's *where the external signal hides*. One direction internalizes the reward model into the policy itself. Post-completion learning trains a model to evaluate its own output in the unused sequence space after its answer, computing its own reward at zero inference cost Can models learn to evaluate their own work during training?. A broader survey of late-2025 work finds this is a convergent trend: verifier-free RL keeps decomposing into three substitutable tricks — pairwise self-judgment replacing the reward model, internal belief-shift replacing the critic, and rich-feedback self-distillation replacing the reward signal — each emerging from the policy's own computation Can language models replace reward models with internal signals?.

The other direction asks what the reward model actually *teaches*. Several notes suggest scalar rewards are informationally thin. Feedback naturally splits into evaluative ('how good was that') and directive ('how should it change') channels, and a single number throws the directive half away Can scalar rewards capture all the information in agent feedback?. That lost information is exactly what lets natural-language critiques break through plateaus that more numerical reward can't move Can natural language feedback overcome numerical reward plateaus?. And the reward's *shape* matters as much as its richness: binary correctness rewards push models toward confident guessing and wreck calibration, fixable by adding a proper scoring rule as a second term Does binary reward training hurt model calibration?; surprisingly, training on *only* the negative signal — suppressing wrong trajectories — can match full RL while preserving the diversity that self-improvement needs Does negative reinforcement alone outperform full reinforcement learning?.

There's also a sobering note on what reward learning even does. RLVR appears to activate reasoning strategies already latent in pretraining rather than teach new ones — a single example can trigger it, and spurious rewards work nearly as well as correct ones for a well-pretrained model What does reward learning actually do to model reasoning?. That reframes the whole interaction: if the reward is mostly *unlocking* existing capability, self-improvement and reward models are two ways of pointing a model at skills it already has, and the question becomes which one points more cheaply and with less hacking.

The corpus's most practical advice is about keeping that interaction honest. Reward hacking is the failure mode that haunts self-scoring, and the fix is often architectural separation: use rubrics as *gates* that accept or reject whole rollouts rather than as dense rewards to optimize against, which preserves their strength without inviting exploitation Can rubrics and dense rewards work together without hacking?. Tree search offers a different escape — MCTS plus critic models manufactures dense, process-level reward signals without human annotation, letting the model's own search supply the external-feeling anchor Can tree search replace human feedback in LLM training?. And how you process the resulting trajectories isn't neutral: treating successes as concrete demonstrations and failures as abstracted lessons beats processing both uniformly Should successful and failed episodes be processed differently?. The thread running through all of it: a model improving itself is really a model arranging for a trustworthy external signal — and most of the engineering is in making that signal hard to fake.


Sources 12 notes

What limits how much models can improve themselves?

Models can only improve themselves when they verify solutions better than they generate them. This gap scales with model size but vanishes entirely for factual tasks, predicting which domains benefit from self-improvement.

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Next inquiring lines