SYNTHESIS NOTE
Training, RL, and Test-Time Scaling

Can rubrics and dense rewards work together without hacking?

Explores whether reward signals derived from rubrics suffer from exploitation, and whether separating rubric judgments from optimization signals could prevent this failure mode.

Synthesis note · 2026-05-18 · sourced from Reasoning Methods CoT ToT
What actually changes inside a model during RL training? How well do reward models actually evaluate AI reasoning?

A familiar RL temptation when training on unverifiable tasks: take a rubric that says "good answers do X, Y, Z," score every rollout against the rubric, and treat the score as a dense reward. DRO argues this is exactly the wrong move. Token-level dense rewards alone are vulnerable to reward hacking — a rollout group can produce uniformly low-quality answers that still exhibit relative differences under the token-level metric, misleading the gradient. Rubrics provide the supervision that fixes this. But converting rubric judgments into dense rewards is brittle: rubric scores are noisy, gameable, and discontinuous in ways that dense gradients amplify.

The architectural alternative is to use rubrics as gates rather than as rewards. A rollout group is accepted or rejected based on whether it meets essential task criteria. Rollouts that fail are dropped — they do not contribute to the gradient at all. Rollouts that pass go forward to the token-level dense reward. The two signals serve different functions: the rubric defines feasibility (a hard boundary on what counts as a valid answer); the dense reward defines optimization direction (how to improve among valid answers).

The separation matters because the two signals have different statistical properties. Rubric judgments are good at hard accept/reject decisions ("does this answer cite a source?") and bad at dense gradient supervision ("how much better is answer A than answer B at citing sources?"). Dense rewards are good at fine-grained gradient supervision and bad at hard constraints. Each does what it does well; mixing them inherits the failure modes of both.

The principle generalizes beyond DRO. Whenever an RL setup has both a fine-grained quality signal and a categorical correctness signal, treating the categorical signal as a multiplicative gate rather than as an additive reward preserves its categorical nature and prevents the dense optimizer from finding loopholes in the categorical judgment.

Inquiring lines that use this note as a source 105

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 93 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

separating optimization from feasibility — dense token-level rewards plus rubric hard-gates on final answers — prevents the reward hacking that pure rubric-derived rewards invite