SYNTHESIS NOTE

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Synthesis note · 2026-02-22 · sourced from Reward Models

Standard reward model training (Bradley-Terry MLE) does not force the model to consider prompts. Since different training samples typically contain distinct response pairs, the model can learn to distinguish chosen from rejected responses based on response features alone — effectively ignoring the prompt. The reward gap between chosen and rejected centers around the same values even after prompt replacement.

This is not a minor calibration issue — it is a structural failure. When the reward model only learns response-level biases (e.g., "longer is better," "confident tone is better"), it cannot generalize to novel prompt-response pairs. A response that happens to be well-written will receive high reward regardless of whether it actually answers the question. This makes RLHF optimize against phantom quality signals — response biases masquerading as prompt alignment.

The Information-Theoretic Reward Decomposition approach (Li et al., 2025) splits the reward into two components without requiring extra models: prompt-free reward (determined solely by response features) and prompt-related reward (derived from the prompt-response interaction). The prompt-free component exposes the model's bias — it reflects preference that has nothing to do with the prompt. The prompt-related component captures genuine alignment between prompt and response.

The practical fix: prioritize training samples where the prompt-related reward gap is large relative to the prompt-free reward gap. This focuses learning on samples where the prompt actually matters for the preference, rather than samples where response quality alone determines the outcome.

The finding connects to a broader pattern: Can LLM judges be fooled by fake credentials and formatting? — prompt-insensitivity is a specific mechanism underlying why judge evaluations fail. It also parallels Can LLM explanations actually help humans predict model behavior? — in both cases, what looks like quality evaluation is actually decoupled from the semantically relevant signal.

Inquiring lines that read this note 8

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What properties determine whether reward signals teach genuine reasoning?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 161 in 2-hop network ·medium cluster Open in graph ↗

Do reward models actually consider what the prom… Can LLM judges be fooled by fake credentials and f… Can LLM explanations actually help humans predict … Does transformer attention architecture inherently…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can LLM judges be fooled by fake credentials and formatting? Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
prompt-insensitivity is a mechanism underlying exploitable judge biases
Can LLM explanations actually help humans predict model behavior? Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
parallel decoupling: explanation quality decoupled from actual explanation precision, reward quality decoupled from actual prompt relevance
Does transformer attention architecture inherently favor repeated content? Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
response-level bias may compound with attention-level bias

Do reward models actually consider what the prompt asks?

Inquiring lines that read this note 8

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4