Do reward models actually consider what the prompt asks?
Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
Standard reward model training (Bradley-Terry MLE) does not force the model to consider prompts. Since different training samples typically contain distinct response pairs, the model can learn to distinguish chosen from rejected responses based on response features alone — effectively ignoring the prompt. The reward gap between chosen and rejected centers around the same values even after prompt replacement.
This is not a minor calibration issue — it is a structural failure. When the reward model only learns response-level biases (e.g., "longer is better," "confident tone is better"), it cannot generalize to novel prompt-response pairs. A response that happens to be well-written will receive high reward regardless of whether it actually answers the question. This makes RLHF optimize against phantom quality signals — response biases masquerading as prompt alignment.
The Information-Theoretic Reward Decomposition approach (Li et al., 2025) splits the reward into two components without requiring extra models: prompt-free reward (determined solely by response features) and prompt-related reward (derived from the prompt-response interaction). The prompt-free component exposes the model's bias — it reflects preference that has nothing to do with the prompt. The prompt-related component captures genuine alignment between prompt and response.
The practical fix: prioritize training samples where the prompt-related reward gap is large relative to the prompt-free reward gap. This focuses learning on samples where the prompt actually matters for the preference, rather than samples where response quality alone determines the outcome.
The finding connects to a broader pattern: Can LLM judges be fooled by fake credentials and formatting? — prompt-insensitivity is a specific mechanism underlying why judge evaluations fail. It also parallels Can LLM explanations actually help humans predict model behavior? — in both cases, what looks like quality evaluation is actually decoupled from the semantically relevant signal.
Inquiring lines that use this note as a source 8
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do reward models trained for accuracy ignore important context about the input?
- How does prompt context decomposition reveal hidden reward model failures?
- Why do reward models fail when they ignore the prompt context?
- What four distinct biases emerge when reward models ignore the prompt?
- Do reward reasoning models with chain-of-thought reasoning evaluate prompts better?
- Can decomposing rewards into prompt-free and prompt-related components fix this blindspot?
- How do checklists prevent reward models from exploiting superficial response artifacts?
- Why do reward models fail to recognize genuinely different valid answers?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can LLM judges be fooled by fake credentials and formatting?
Explores whether language models evaluating text fall for authority signals and visual presentation unrelated to actual content quality, and whether these weaknesses can be exploited without deep model knowledge.
prompt-insensitivity is a mechanism underlying exploitable judge biases
-
Can LLM explanations actually help humans predict model behavior?
Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
parallel decoupling: explanation quality decoupled from actual explanation precision, reward quality decoupled from actual prompt relevance
-
Does transformer attention architecture inherently favor repeated content?
Explores whether soft attention's tendency to over-weight repeated and prominent tokens explains sycophancy independent of training. Questions whether architectural bias precedes and enables RLHF effects.
response-level bias may compound with attention-level bias
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Information-Theoretic Reward Decomposition for Generalizable RLHF
- Reward Reasoning Model
- ARGS: Alignment as Reward-Guided Search
- Checklists Are Better Than Reward Models For Aligning Language Models
- Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
- Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future
- RewardBench: Evaluating Reward Models for Language Modeling
- Simple Synthetic Data Reduces Sycophancy In Large Language Models
Original note title
reward models ignore prompt context when evaluating responses — decomposing into prompt-free and prompt-related components reveals and corrects the generalization failure