SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Agentic Systems and Tool Use

How can rubric-based rewards resist reward hacking attacks?

Single rubrics are easily exploited by models, and simply adding more rubrics yields diminishing returns. What design patterns and defensive mechanisms actually prevent reward hacking in rubric-based RL systems?

Synthesis note · 2026-02-22 · sourced from RLVR
How do domain training techniques actually reshape model behavior? How should researchers navigate LLM reasoning research? What does reward learning actually do to model reasoning?

Extending RLVR beyond verifiable domains via rubric-based rewards faces a practical reality: the success hinges tightly on rubric design, not just the rubric concept. Single rubrics are rapidly exploited by models. Indiscriminately scaling rubric quantity — whether human or LLM-generated — yields only marginal gains. The effective path requires careful engineering across multiple dimensions.

The Rubric Anchors framework constructs over 10,000 rubrics spanning multiple scopes (dataset-level, task-level, instance-specific) and generation methods (human experts, self-critique models, powerful teacher APIs, hybrid). Extensive ablation reveals that success requires specific combinations of diversity, granularity, and quantity.

Four architectural mechanisms prove essential:

A "seesaw effect" emerges during training: jointly training on different task types (strict constraint-following vs. open-ended creativity) often reduces overall performance due to conflicting optimization objectives. Stage-wise RL scheduling — building constraint-handling foundations before extending to creative tasks — provides pragmatic mitigation.

The adaptive defense against reward hacking is iterative: offline analysis of rollout data identifies recurring exploitation patterns, which inform a dedicated Reward Hacking Defense Rubric integrated as a supervisory constraint in subsequent stages. This yields marked improvement in training stability and enables longer productive training epochs.

Inquiring lines that use this note as a source 6

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

rubric-based rl requires adaptive defense against reward hacking — single rubrics are exploitable while indiscriminate scaling yields marginal gains