Can decomposing rewards into prompt-free and prompt-related components fix this blindspot?
This explores whether splitting a reward model's score into a 'prompt-free' part (how good the response looks on its own) and a 'prompt-related' part (how well it actually answers what was asked) can cure reward models' habit of rewarding fluent-but-irrelevant answers.
This explores whether splitting a reward signal into a 'prompt-free' component (how polished a response looks on its own) and a 'prompt-related' component (how well it answers what was actually asked) can fix a known blindspot in reward models. The corpus says: yes, this decomposition is precisely the diagnostic that exposes the problem. Standard reward models quietly learn response-level biases — they reward writing that's smooth, confident, and well-formatted regardless of whether it's on-topic — and the only way to see this is to factor the score into what depends on the prompt versus what doesn't. Once you separate the two, you can target the fix directly instead of hoping a holistic score happens to weigh relevance (Do reward models actually consider what the prompt asks?).
What makes this more than a one-paper trick is that decomposition keeps showing up across the corpus as the general remedy for reward signals that collapse too much into a single number. Breaking instruction quality into a checklist of verifiable sub-criteria reduces the same overfitting to 'superficial artifacts' that plagues holistic reward models (Can breaking down instructions into checklists improve AI reward signals?). Treating rubrics as gates that accept or reject a whole rollout — rather than mashing rubric scores into a dense reward — preserves their categorical strength and blocks reward hacking (Can rubrics and dense rewards work together without hacking?). And feedback itself turns out to carry two orthogonal channels — evaluative ('how good was this?') and directive ('how should it change?') — that a scalar reward physically cannot hold at once (Can scalar rewards capture all the information in agent feedback?). The recurring lesson: a scalar is a lossy container, and naming the components you've been averaging together is usually where the fix begins.
There's a second, complementary route to the same blindspot that doesn't touch the reward at all — fix the model's sensitivity to prompts directly. Consistency training teaches a model to respond identically to a clean prompt and a 'wrapped' or perturbed version of it, using the model's own clean answers as the target, so it learns to ignore irrelevant surface changes while staying anchored to intent (Can models learn to ignore irrelevant prompt changes?). Read alongside the reward-decomposition work, this suggests two ends of the same lever: you can teach the judge to stop ignoring the prompt, or teach the responder to stop being swayed by prompt noise.
Where this gets genuinely interesting is the question of why a single number was ever expected to carry prompt-relevance in the first place. When numerical rewards plateau, it's because they encode that a failure happened but not why or how to recover — natural-language critiques break those plateaus precisely by restoring the information a scalar threw away (Can natural language feedback overcome numerical reward plateaus?). And reward models score better when they're allowed to reason before judging rather than emitting a snap scalar (Can reward models benefit from reasoning before scoring?). So 'prompt-free vs. prompt-related' isn't just a clever fix for one bug — it's one instance of a larger pattern the corpus keeps circling: the moment you stop compressing evaluation into a single opaque number, the failures you couldn't name suddenly become things you can measure, gate, and correct.
Sources 7 notes
Standard reward models learn response-level biases instead of prompt-response alignment, causing them to reward responses that are well-written but irrelevant. Decomposing reward into prompt-free and prompt-related components reveals this failure and enables targeted fixes.
RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.