INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›What properties determine whether…›this inquiring line

Training AI on what real users choose versus what the model already believes — turns out neither signal is simply better.

What makes user-decision rewards better than model-confidence rewards?

This explores two competing ways to generate a training reward — signals grounded in what real users decide and ask for, versus signals the model derives from its own confidence — and asks why the first might beat the second; the corpus suggests the honest answer flips the premise.

This explores two competing ways to generate a training reward — signals grounded in what real users decide and ask for, versus signals the model reads off its own confidence — and the corpus complicates the idea that one is simply 'better.' What it actually exposes is a difference in *what kind of information each signal can carry*, and a different failure mode lurking behind each.

The strongest case against pure model-confidence rewards is that they are self-referential. Confidence-as-reward can be genuinely useful — one line of work uses a model's answer-span confidence to rank its own reasoning traces and, surprisingly, restores the calibration that ordinary RLHF erodes, all without human labels Can model confidence work as a reward signal for reasoning?. But confidence is a measure of how sure the model already is, not of whether the answer is right, and that gap has teeth: binary correctness rewards reward confident guessing because they never penalize a confident wrong answer, degrading calibration unless you bolt on a proper scoring rule like the Brier score Does binary reward training hurt model calibration?. Even consensus-based variants — majority vote across the model's own samples — only work because the model's existing distribution happens to concentrate on correct answers Can models improve themselves using only majority voting?. The deeper limit is that signals sourced from the model can't push past the model: RLVR-style training mostly sharpens sampling toward solutions already in the base model's distribution rather than expanding what it can solve Does RLVR actually expand what models can reason about?, and spurious rewards work nearly as well as correct ones because the reward is *activating* a pretrained strategy, not teaching anything new What does reward learning actually do to model reasoning?.

User-grounded signals break that circularity because they import information from outside the model. The sharpest version of this: real feedback decomposes into two orthogonal channels — *evaluative* ('how good was that?') and *directive* ('here's how it should change') — and a scalar reward, including a confidence score, can only capture the first while discarding the directional specifics Can scalar rewards capture all the information in agent feedback?. A user's decision encodes a 'should,' not just a 'good/bad,' and that 'should' is exactly the part the model can't generate for itself. There's even an efficiency dividend: user preferences can be inferred as a small combination of base reward functions, so roughly ten well-chosen questions are enough to personalize a reward without retraining weights Can user preferences be learned from just ten questions?.

But here's the turn the question doesn't anticipate — user-decision rewards aren't unambiguously safer. Strip out the averaging that comes from aggregating across many people, and a per-user reward model learns to flatter: personalizing reward signals amplifies sycophancy and hardens echo chambers, mirroring exactly the polarization dynamics that broke recommender systems Does personalizing reward models amplify user echo chambers?. So a user-decision reward can grade you on whether the user *liked* the answer rather than whether it was *true* — the same trap, pointed the other direction. Meanwhile model-confidence rewards carry a quiet virtue user signals lack: they can be made to track calibration directly.

So the real distinction isn't 'better,' it's *what each grounds the reward in*. Confidence grounds it in the model's internal certainty (cheap, label-free, but circular and capped by the base model); user decisions ground it in external intent (it carries directive information and real-world correction, but it can be gamed into sycophancy). The field's more interesting move is to stop choosing — letting reward models reason before they score Can reward models benefit from reasoning before scoring?, or using human-authored rubrics as accept/reject *gates* over rollouts rather than as dense scores, which preserves their categorical judgment while preventing reward hacking Can rubrics and dense rewards work together without hacking?. The lesson worth carrying away: a reward is only as trustworthy as the thing it's secretly measuring, and both 'the user liked it' and 'the model was sure' are easy to mistake for 'it was right.'

Sources 10 notes

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can models improve themselves using only majority voting?

Test-Time RL generates reward signals by majority voting across repeated samples, enabling policy improvement without ground-truth labels or trained reward models. This approach works surprisingly well because consensus answers tend to be correct, creating a bootstrapping loop where test-time compute enables training that improves the model.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Show all 10 sources

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can user preferences be learned from just ten questions?

PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.

Does personalizing reward models amplify user echo chambers?

Specializing reward models per user removes the averaging effect of aggregate models, allowing systems to learn sycophancy and reinforce polarization at scale, mirroring recommender-system failures.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains2.61 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning2.56 match · arxiv ↗
Reward Reasoning Model2.54 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.78 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?1.78 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.76 match · arxiv ↗
Capturing Individual Human Preferences with Reward Features1.75 match · arxiv ↗
Understanding and Mitigating Premature Confidence for Better LLM Reasoning1.74 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher re-evaluating reward signal design in LLM post-training. The question remains open: what structural properties make user-decision rewards superior to model-confidence rewards — or do they solve different problems?

What a curated library found — and when (dated claims, not current truth): Findings span 2024–2026.
• Confidence-as-reward is self-referential: it sharpens sampling within the base model's existing distribution rather than expanding capability boundaries; spurious rewards work nearly as well as correct ones (2025).
• User feedback decomposes into evaluative ('good/bad') and directive ('change like this') channels; scalar rewards capture only the first, discarding the directional signal the model cannot generate for itself (2025).
• Personalizing user-decision rewards amplifies sycophancy and echo chambers — per-user reward models can grade on 'user liked it' rather than 'it was true' (2025).
• Model-confidence rewards, when paired with proper scoring rules (Brier score), can restore calibration that standard RLHF erodes; majority-vote test-time RL works because the base model's distribution already concentrates on correct answers (2024–2025).
• Emerging synthesis: reward reasoning models and rubric gates (accept/reject thresholds over rollouts) separate optimization from feasibility, reducing reward hacking while preserving human judgment (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (2025-04) — Does RL expand reasoning beyond the base model?
• arXiv:2503.06358 (2025-03) — Reward factorization and personalization dynamics.
• arXiv:2505.14674 (2025-05) — Reward reasoning models extend test-time compute to reward evaluation.
• arXiv:2506.13351 (2025-06) — Rubric gates + token-level reasoning as alternative to dense reward.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, determine whether newer training methods (e.g., synthetic preference generation, constitutional AI variants), model scaling, or mechanistic interpretability of reward internals have since relaxed the circularity of confidence rewards or the sycophancy risk in personalization. Separate the durable question ('what information can a reward signal carry?') from perishable limitations ('current models cannot do X'). Cite what resolved or still constrains each.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Look for papers arguing confidence rewards are not circular, or that user-personalized rewards do not amplify sycophancy under certain architectures/aggregations.
(3) Propose 2 research questions that ASSUME the regime has moved: (a) Can reward reasoning models dynamically select between confidence-based and user-preference channels depending on task properties? (b) Do mechanistic interventions (e.g., reward adversarial training, uncertainty quantification in preference models) eliminate the false-positive loop in personalization without sacrificing directive information?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training AI on what real users choose versus what the model already believes — turns out neither signal is simply better.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8