SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can online LLM feedback improve direct preference optimization during training?

Direct alignment methods like DPO use fixed preference data from older models, creating off-policy training. Could sampling fresh responses from the current model and using an LLM judge to pick preferences in real time reduce overfitting and improve alignment?

Synthesis note · 2026-06-03 · sourced from Reinforcement Learning

Direct alignment from preferences (DPO, IPO, SLiC) is attractive because it skips the separate reward model and updates the policy directly from pairwise preferences. But its preference datasets are collected ahead of training and never updated, and the responses usually come from a different model — so as the policy evolves, alignment becomes inevitably off-policy and prone to overfitting. OAIF's fix is simple: on each training iteration, sample two responses from the current model and prompt an LLM annotator to pick the preferred one, supplying online feedback. Despite its simplicity, human evaluation shows OAIF beats both offline DAP and RLHF, and it mitigates reward over-optimization — the overfitting that plagues offline DAP.

Two keepers. First, the online vs offline distinction matters more than the choice among DAP variants: OAIF improves DPO, IPO, and SLiC alike, isolating on-policy feedback as the lever. Second, the AI annotator's feedback is controllable via instruction prompts — you can steer the alignment target by changing how you ask the judge to choose.

This connects the vault's alignment-method thread to the LLM-as-judge thread. The controllable AI annotator inherits the risks documented in Can LLM judges be fooled by fake credentials and formatting? — an online judge that is biased steers the policy toward those biases — and the on-policy framing rhymes with Can agents learn from failure without updating their weights? in treating fresh, current-model feedback as the signal that matters.

Inquiring lines that use this note as a source 3

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 153 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

online AI feedback makes direct preference optimization on-policy — sampling from the current model and judging with an LLM beats offline DPO and RLHF