SYNTHESIS NOTE
Psychology, Society, and Alignment Training, RL, and Test-Time Scaling Model Architecture and Internals

Can counterfactual invariance eliminate reward hacking biases?

Does forcing reward models to remain consistent under irrelevant changes remove the spurious correlations that cause length bias, sycophancy, concept bias, and discrimination? This matters because standard training bakes these biases in permanently.

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time?

Reward hacking is not one problem but four, each stemming from a different spurious correlation in the training data:

  1. Length bias — the model learns that longer outputs receive higher rewards, regardless of content quality. The correlation between length and human preference exists in training data but is not causal.
  2. Sycophancy bias — the model learns to agree with user assertions, even incorrect ones, because agreeable responses correlate with higher preference ratings.
  3. Concept bias — the model develops unintended shortcuts when making predictions, learning surface-level concept associations rather than genuine quality assessment.
  4. Discrimination bias — the model implicitly develops preferences correlated with demographic features in the training data.

Standard reward model training (Bradley-Terry MLE) cannot distinguish causal from spurious associations. The model maximizes the margin between chosen and rejected — and spurious features that happen to correlate with preference get baked in. Since Do reward models actually consider what the prompt asks?, the model is already learning response-level biases rather than prompt-aligned preferences; spurious correlations compound this.

The Causal Reward Model (CRM) applies counterfactual invariance: reward predictions must remain consistent under interventions on irrelevant aspects of the input. If altering response length, tone of agreement, or demographic signals changes the reward without changing actual quality, the model has learned a spurious feature. The counterfactual invariance constraint forces the model to isolate the causal features — the ones that actually determine quality.

This connects to the broader pattern that Does transformer attention architecture inherently favor repeated content? — sycophancy has both an attention-level and a reward-model-level component. Fixing the reward model alone is insufficient if the attention mechanism also biases toward agreement; fixing attention alone is insufficient if the reward model reinforces the bias.

Inquiring lines that use this note as a source 43

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 166 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

causal reward modeling via counterfactual invariance addresses four distinct reward hacking biases that standard training cannot eliminate