Is elaborate reward shaping necessary if the pretrained prior already contains good solutions?
This explores a tension in the corpus: when a model's pretraining already contains good solutions, does reward engineering actually matter — or does almost any signal work because RL is just surfacing what's already there?
This explores whether elaborate reward shaping earns its keep when the pretrained prior already holds good solutions — and the corpus leans surprisingly hard toward "no, not for the reasons you'd think." The strongest version of this argument is that reinforcement learning on verifiable rewards doesn't teach new reasoning at all; it sharpens sampling toward solutions the base model could already produce Does RLVR actually expand what models can reason about?. If RL is mostly an activation mechanism, then the precise shape of the reward should matter less than whether it points roughly the right way. That prediction holds up almost shockingly well: a single training example can trigger the gains, and *spurious* rewards work nearly as well as correct ones — provided the pretraining was good enough to have the strategy waiting What does reward learning actually do to model reasoning?. In that light, elaborate shaping is often dressing on a process that's really just unlocking a latent prior.
A second cluster pushes the same direction from different angles, suggesting the reward can be radically minimal. Negative reinforcement alone — just suppressing wrong trajectories, never rewarding right ones — matches or beats full PPO/GRPO, and it preserves answer diversity that positive-only training collapses Does negative reinforcement alone outperform full reinforcement learning?. You can also let the model supply its own dense signal: an agent's belief-shift toward the solution serves as intrinsic per-turn credit with no critic network or process-reward model bolted on Can an agent's own beliefs guide credit assignment without critics?. And on hard problems, sophisticated domain reasoning emerges from plain accuracy signals with no chain-of-thought distillation at all Can simple rewards alone teach complex domain reasoning?. The common thread: when the prior is strong, the reward's job is to *select*, not to *instruct* — and selection is cheap.
But here's the turn that makes this question interesting — the corpus also marks exactly where the minimal-reward story breaks, and it's precisely the places where the prior *doesn't* contain what you need. Binary correctness rewards quietly degrade calibration, because nothing penalizes a confident wrong answer; you have to add a proper scoring rule (Brier) to fix it — a case where shaping is mathematically necessary, not decorative Does binary reward training hurt model calibration?. When models hit performance plateaus, scalar rewards stall because they carry no information about *why* a solution failed; natural-language critique breaks through where more numerical reward cannot Can natural language feedback overcome numerical reward plateaus?. That's because feedback genuinely decomposes into two channels — evaluative (how good) and directive (how to change) — and a scalar can only ever carry the first Can scalar rewards capture all the information in agent feedback?.
So the real answer reframes the question. Elaborate shaping isn't necessary for *activating* good solutions already in the prior — there, simple, sparse, even noisy rewards suffice, and complexity is mostly wasted. Shaping becomes necessary the moment you're asking for something the prior lacks: well-calibrated confidence, escape from a capability plateau, or genuinely new reasoning patterns (which the corpus suggests come from distillation, not RL Does RLVR actually expand what models can reason about?). Even the reward model itself can be made smarter rather than more elaborate — letting it reason before it scores raises its ceiling more than hand-tuning the signal does Can reward models benefit from reasoning before scoring?.
The thing you didn't know you wanted to know: the question "is shaping necessary?" has no universal answer because reward shaping and the pretrained prior are substitutes. A rich prior lets you get away with a crude reward; a crude reward only fails you in the gaps the prior didn't already fill. The engineering question isn't "how elaborate should my reward be?" — it's "what, specifically, is missing from the prior, and is that gap evaluative or directive?"
Sources 9 notes
Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.
Medical AI systems and o3 demonstrate that sophisticated domain reasoning emerges naturally from RL training on difficult problems with only basic accuracy signals, without requiring explicit chain-of-thought distillation from teacher models.
Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.