SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Why does self-rewarding training collapse when responses improve?

Self-Rewarding LLMs merge generator and evaluator for efficient iteration, but both improve so fast that good and bad responses converge, erasing the learning signal. What causes this failure and how can it be fixed?

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Self-Rewarding Language Models (Yuan et al., 2024) merge the generator and the evaluator into a single model. The model generates candidate responses, evaluates them via LLM-as-a-Judge prompting, selects preference pairs, and trains via DPO. Each iteration improves both instruction following and reward quality. This co-evolution sidesteps the frozen-reward-model bottleneck — the evaluator grows alongside the generator.

But the approach hits a wall. The Temporal Self-Rewarding paper (2025) identifies the mechanism: as both chosen and rejected responses improve across iterations, their representations converge. The quality gap between "best" and "worst" responses shrinks — the score gap narrows by 9x. When chosen and rejected responses become representationally similar, the DPO gradient vanishes. The model can no longer learn because it can no longer distinguish good from bad.

This is a different failure mode from Does policy entropy collapse limit reasoning performance in RL? (which is about narrowing action diversity) and from How quickly do errors compound during model self-training? (which is about accumulating errors). Here, the model is actually improving — but the improvement itself destroys the preference learning signal.

The fix is temporal decoupling: (1) Anchored Rejection — fix rejected responses using outputs from the initial SFT model (past generation), preventing quality inflation in negative samples. (2) Future-Guided Chosen — select positive samples using a temporarily trained next-generation model, accessing superior responses unavailable to the current model. By decoupling chosen and rejected responses across temporal versions, the representational gap is maintained without additional compute (the method uses half the training iterations of standard Self-Rewarding).

The broader implication: any iterative self-improvement loop where the same model evaluates and generates will eventually converge unless the evaluation signal is anchored to an external reference point — whether that's a past model, a future model, or an external critic.

Complementary fix — Meta-Rewarding: Why do self-improvement loops eventually stop improving? addresses the same saturation problem from a different angle. While temporal anchoring fixes the preference signal (maintaining the chosen-rejected gap), Meta-Rewarding fixes the evaluator quality by adding a meta-judge that evaluates the judge's judgments. The two solutions are complementary: temporal anchoring prevents gradient collapse; meta-judging prevents evaluation stagnation. A system could use both — meta-judging to improve judge accuracy, temporal anchoring to maintain preference signal strength.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 124 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

self-rewarding iterative training creates a co-evolution loop but suffers gradient collapse when chosen-rejected responses converge — temporal anchoring to past and future models maintains the learning signal