SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Why do self-improvement loops eventually stop improving?

Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?

Synthesis note · 2026-02-22 · sourced from Self Refinement Self Consistency Feedback
How should we allocate compute budget at inference time?

Meta-Rewarding (Llama-3-8B-Instruct) demonstrates that self-improvement loops stall not because the actor can't improve, but because the judge that evaluates improvement doesn't keep up. Prior self-rewarding work unified generator and evaluator in a single model, improving the actor through iterative DPO on self-generated preference pairs. But the judge capability remained static — the same evaluation quality was applied to increasingly sophisticated outputs. The result: saturation, or worse, reward hacking against a fixed evaluation surface.

The fix is a third role: the meta-judge. The model evaluates its own judgments using LLM-as-a-Meta-Judge prompting — selecting the better of two judgments on the same response. This creates preference data for the judge, not just for the actor. Training on both actor and judge preferences via DPO co-evolves both capabilities.

The results are surprisingly strong for an unsupervised method: AlpacaEval 2 win rate from 22.9% to 39.4%, Arena-Hard from 20.6% to 29.1%. The meta-judging step focuses on responses where the judge is least certain (highest score variance), targeting calibration at the decision boundary.

A practical complication: length explosion. With each iteration, responses grow longer because the judge has a length bias — a well-known reward model problem. Meta-Rewarding requires explicit length control to prevent this.

This is a different solution to the same problem addressed by Why does self-rewarding training collapse when responses improve?. Temporal anchoring fixes the preference signal (maintaining the gap between chosen and rejected). Meta-judging fixes the evaluator quality (making the judge more accurate). The two fixes are complementary — a system could use both.

The broader principle: any self-improvement loop where the evaluator doesn't improve alongside the learner will eventually stall. This applies to RLHF (frozen reward models), self-rewarding (same-model judging), and even human-in-the-loop systems where human evaluators don't recalibrate as models improve.

Inquiring lines that use this note as a source 4

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 130 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

self-improvement requires co-evolving the evaluator alongside the actor — a static judge becomes the ceiling that constrains actor training