SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can environment feedback replace scalar rewards in policy learning?

Can rich tokenized feedback from environments serve as a direct learning signal for policies, without relying on compressed scalar rewards? This matters because scalar rewards discard information needed for credit assignment.

Synthesis note · 2026-05-18 · sourced from Reinforcement Learning
What actually changes inside a model during RL training? How well do reward models actually evaluate AI reasoning?

RLVR's central limitation is information-theoretic. The reward is a scalar per rollout. The environment, in many real verifiable settings, actually produces far richer signal: runtime errors, failing unit tests, judge evaluations, compile traces. RLVR collapses all of this to a single number. The scalar bottleneck creates the credit-assignment problem — which tokens caused the failure? The reward alone cannot say.

Self-Distillation Policy Optimization (SDPO, 2601.20802) introduces a different paradigm: Reinforcement Learning with Rich Feedback (RLRF). Tokenized environment feedback is the supervision signal. The conversion mechanism is elegant: the current policy conditioned on the feedback serves as the self-teacher. Its next-token distribution is what the policy "would have generated" had it known the feedback in advance. SDPO distills this feedback-informed distribution back into the unconditioned policy.

The trick is that no external teacher is required. Distillation usually needs a stronger model. SDPO leverages a different fact: the same model, when given retrospective evidence of its mistakes in-context, can identify what it should have done. The model is implicitly a process reward model — through retrospection — if given rich feedback. The student is bootstrapped by repeatedly imitating an improved version of itself, where "improved" means "conditioned on richer information."

The mechanism connects directly to Can agents learn from failure without updating their weights?. Reflexion converts environment feedback into stored verbal reflections used at the next rollout. SDPO converts environment feedback into gradient-distilled improvements to the policy weights. Both reject the scalar reward as load-bearing; both treat environment signal as already containing the teaching. SDPO is the parameter-updating analog of Reflexion's memory-updating mechanism.

A second connection is structural: this is in-context learning used as supervision. Since the model can integrate feedback in-context, the difference between the with-feedback and without-feedback distributions IS the gradient signal. The policy doesn't need to discover what to do — it needs to internalize what its with-feedback self already knows.

The implication for the broader RL landscape: each language model is implicitly a PRM through retrospection. The reward model is not load-bearing if rich tokenized feedback is available.

Inquiring lines that use this note as a source 21

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
16 direct connections · 102 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rich tokenized environment feedback can be converted to dense credit assignment via self-distillation — the policy conditioned on feedback is its own teacher