INQUIRING LINE

Why do reward models fail when they ignore the prompt context?

This explores why reward models — the AI graders used to train chatbots — produce bad scores when they grade a response without really checking it against what the prompt actually asked.


This explores why reward models — the AI graders that score responses during RLHF training — break down when they ignore the prompt and judge a response on its own. The corpus has a clean diagnosis: standard reward models quietly learn *response-level* habits instead of *prompt-response alignment*. The sharpest evidence is a swap test — keep a response identical but change the question it was supposedly answering, and the reward score barely moves Why do reward models ignore what question was asked?. That's the tell. The model isn't grading whether the answer fits the question; it's grading whether the answer *looks* good — fluent, confident, well-formatted. So you get high marks for a polished response that's irrelevant to what was asked, and the training signal becomes a phantom: you're optimizing against the appearance of quality rather than actual helpfulness Do reward models actually consider what the prompt asks?.

The fix that keeps surfacing is decomposition — split the reward into a prompt-free part (how good the response looks in isolation) and a prompt-related part (how well it answers *this* question), so you can see the blind spot and correct it directly Do reward models actually consider what the prompt asks?. This mirrors a finding from the feedback-signal side of the corpus: a single scalar score is a lossy container. Real feedback carries two separable things — an *evaluative* signal (how well it did) and a *directive* one (what should change) — and collapsing both into one number throws away exactly the kind of context that a richer signal preserves Can scalar rewards capture all the information in agent feedback?. The reward model's prompt-blindness is one instance of that general lossiness.

What's worth noticing is that this isn't just a reward-model quirk — it's the same failure shape that shows up when *any* language model ignores its context. There's research showing models generate outputs that contradict their own context because strong parametric priors from training override the information sitting right in front of them, and that plain textual prompting can't fix it — you need to intervene in the model's internal representations Why do language models ignore information in their context?. A reward model is a language model wearing a judge's robe, so it inherits the same disease: trained-in habits about what "good" looks like drown out the specific question being asked.

The corpus also points toward a more interesting cure than patching biases: make the grader *think* before it scores. Three independent teams found that adding a chain-of-thought reasoning trace before the reward judgment raises the ceiling of what reward models can evaluate, because reasoning forces the model to actually engage with the prompt-response relationship rather than pattern-match on surface quality Can reward models benefit from reasoning before scoring?. And a related thread asks whether you need an external grader at all — models can be trained to internalize self-evaluation, computing their own reward in the unused space after their answer Can models learn to evaluate their own work during training?.

The thing you might not have known you wanted to know: a prompt-blind reward model doesn't fail randomly — it fails *systematically*, rewarding length, fluency, and format in predictable ways. That's why a model trained against it learns to write answers that are beautiful and beside the point, and why "the grader never read the question" turns out to be one of the quiet root causes of AI sycophancy and verbosity.


Sources 6 notes

Why do reward models ignore what question was asked?

When prompts are swapped while keeping responses identical, reward model scores barely change. This reveals that standard RLHF optimizes against phantom quality signals rather than prompt-response alignment, enabling four distinct biases.

Do reward models actually consider what the prompt asks?

Standard reward models learn response-level biases instead of prompt-response alignment, causing them to reward responses that are well-written but irrelevant. Decomposing reward into prompt-free and prompt-related components reveals this failure and enables targeted fixes.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an RLHF researcher re-testing claims about reward-model prompt-blindness. The question remains open: *Why and when do reward models fail to integrate prompt context into their evaluations?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat all as perishable until re-validated.
• Swap tests show reward scores barely move when identical responses are re-paired with different prompts, indicating response-level habit learning rather than prompt-response alignment (2025-04, arXiv:2504.06020).
• Decomposing reward into prompt-free (surface quality) and prompt-related (answer fit) components reveals and corrects the blind spot (2025-04, arXiv:2504.06020).
• Chain-of-thought reasoning traces before reward judgment force engagement with prompt-response relationships, extending what reward models can reliably evaluate (2025-05, arXiv:2505.14674).
• Models can internalize self-evaluation in post-completion token space, potentially bypassing external grader blindness (2025-07, arXiv:2507.20252).
• Checklists and consistency training outperform scalar reward models at preventing sycophancy and maintaining prompt fidelity (2025-07, arXiv:2507.18624; 2025-10, arXiv:2510.27062).

Anchor papers (verify; mind their dates):
• arXiv:2504.06020 (2025-04) — Information-Theoretic Reward Decomposition for Generalizable RLHF
• arXiv:2505.14674 (2025-05) — Reward Reasoning Model
• arXiv:2507.18624 (2025-07) — Checklists Are Better Than Reward Models For Aligning Language Models
• arXiv:2510.27062 (2025-10) — Consistency Training Helps Stop Sycophancy and Jailbreaks

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer models, scaling, reasoning methods (longer chains-of-thought, self-play, synthetic diversity), evaluation harnesses, or training protocols (DPO, preference ranking, multi-objective tuning) have *relaxed* or *overturned* it. Separate the durable question (whether any single scalar reward integrates context reliably) from perishable claims (that reasoning or decomposition *suffices*). Where does prompt-blindness still appear in post-2026 work?
(2) Surface the strongest *contradicting or superseding* work from the last ~6 months — any evidence that reward models *can* integrate prompts under certain conditions, or that the framing itself is wrong.
(3) Propose 2 research questions that assume the regime may have shifted: (a) Do ensemble or debate-based reward protocols overcome blindness more robustly than chain-of-thought alone? (b) Can in-context learning or retrieval-augmented reward models dynamically adapt to prompt complexity?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines