SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling Model Architecture and Internals

Do reward models actually consider what the prompt asks?

Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

Standard reward model training (Bradley-Terry MLE) does not force the model to consider prompts. Since different training samples typically contain distinct response pairs, the model can learn to distinguish chosen from rejected responses based on response features alone — effectively ignoring the prompt. The reward gap between chosen and rejected centers around the same values even after prompt replacement.

This is not a minor calibration issue — it is a structural failure. When the reward model only learns response-level biases (e.g., "longer is better," "confident tone is better"), it cannot generalize to novel prompt-response pairs. A response that happens to be well-written will receive high reward regardless of whether it actually answers the question. This makes RLHF optimize against phantom quality signals — response biases masquerading as prompt alignment.

The Information-Theoretic Reward Decomposition approach (Li et al., 2025) splits the reward into two components without requiring extra models: prompt-free reward (determined solely by response features) and prompt-related reward (derived from the prompt-response interaction). The prompt-free component exposes the model's bias — it reflects preference that has nothing to do with the prompt. The prompt-related component captures genuine alignment between prompt and response.

The practical fix: prioritize training samples where the prompt-related reward gap is large relative to the prompt-free reward gap. This focuses learning on samples where the prompt actually matters for the preference, rather than samples where response quality alone determines the outcome.

The finding connects to a broader pattern: Can LLM judges be fooled by fake credentials and formatting? — prompt-insensitivity is a specific mechanism underlying why judge evaluations fail. It also parallels Can LLM explanations actually help humans predict model behavior? — in both cases, what looks like quality evaluation is actually decoupled from the semantically relevant signal.

Inquiring lines that use this note as a source 8

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 156 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reward models ignore prompt context when evaluating responses — decomposing into prompt-free and prompt-related components reveals and corrects the generalization failure