INQUIRING LINE

Inquiring lines›How should we train models for cap…›How do attention and architecture…›What constrains reinforcement lear…›this inquiring line

Training an AI on 'right answers' versus 'preferred answers' seems straightforward — until you ask what each reward actually certifies.

What distinguishes verifiable rewards from preference-based rewards in unified training?

This explores the dividing line between two reward types — verifiable rewards (was the answer objectively correct?) versus preference-based rewards (did a judge or human prefer this output?) — and what happens when training tries to fold both into one signal.

This explores the dividing line between rewards you can *check* (math is right, code compiles) and rewards you can only *prefer* (this answer reads better, follows instructions more faithfully) — and what the corpus says about combining them. The short version: the two differ less in mechanism than in what they're allowed to certify, and the interesting work is happening at the seam where one is converted into the other.

Start with what verifiable rewards actually do. A recurring and slightly deflating finding is that reinforcement learning from verifiable rewards (RLVR) doesn't teach models new reasoning — it surfaces strategies already latent in pretraining. Pass@k analysis shows base models beating RLVR models at high sampling budgets Does RLVR actually expand what models can reason about?, and the activation framing is echoed across multiple notes: a single training example can suffice, and even spurious rewards work nearly as well as correct ones for well-pretrained models What does reward learning actually do to model reasoning?, How does RL training reshape reasoning and what gets lost?. So 'verifiable' buys you a sharp, hackable-resistant signal — but a narrow one. It catalyzes; it doesn't expand.

Preference-based rewards have the opposite profile: broad coverage of subjective quality, but soft and gameable. The corpus catalogs the damage. Binary correctness rewards quietly degrade calibration because they never punish confident wrong answers — a flaw fixed by bolting on a proper scoring rule like Brier score as a second term Does binary reward training hurt model calibration?. Holistic preference models overfit to superficial artifacts, which is why instruction-following gets decomposed into verifiable checklist sub-criteria Can breaking down instructions into checklists improve AI reward signals?. The most useful way to read these is as a spectrum, not a binary: 'unified training' is really the project of converting fuzzy preferences into checkable units without losing what made them broad.

The sharpest distinction the corpus draws is structural: how you *combine* the two matters more than which you use. One note shows that rubrics work best as **gates** that accept or reject whole rollout groups, while dense token-level rewards optimize *within* the survivors — converting rubric scores directly into dense rewards invites hacking, but separating feasibility (preference-like) from optimization (verifiable-like) preserves both Can rubrics and dense rewards work together without hacking?. A related insight: scalar feedback can't jointly carry everything. Agent feedback decomposes into *evaluative* signal (how good — what rewards capture) and *directive* signal (how to change — what they discard), making the two complementary rather than substitutable Can scalar rewards capture all the information in agent feedback?. Ternary rewards make the same move by splitting one axis into three — correct, hallucinated, abstained — so abstention becomes learnable instead of collapsed into 'wrong' Can three-way rewards fix the accuracy versus abstention problem?.

Here's the thing you might not have known you wanted: the verifiable/preference boundary may be dissolving from both ends. Reward models are growing reasoning traces and test-time compute, behaving less like fixed verifiers and more like deliberating judges Can reward models benefit from reasoning before scoring?. Meanwhile, a model's *own* confidence can manufacture synthetic preferences that improve reasoning and restore calibration with no human labels or external verifier at all Can model confidence work as a reward signal for reasoning?. The late-2025 literature converges on three substitutable patterns where the policy's internal computations replace the reward model, the critic, and the explicit reward signal entirely Can language models replace reward models with internal signals? — even belief-shift toward a solution becomes its own dense intrinsic reward Can an agent's own beliefs guide credit assignment without critics?. The unified picture isn't 'verifiable plus preference.' It's a continuum where the model increasingly generates both kinds of signal from inside itself.

Sources 12 notes

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

How does RL training reshape reasoning and what gets lost?

Research shows that verifiable rewards act as catalysts that surface existing capabilities from pretraining, not teachers that build new reasoning. RL updates are structurally sparse and bounded by the pretrained prior, not algorithmic sophistication.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can breaking down instructions into checklists improve AI reward signals?

RLCF and RaR methods decompose instruction quality into verifiable sub-criteria, improving performance on benchmarks like FollowBench and HealthBench. This decomposition principle reduces overfitting to superficial artifacts that plague holistic reward models.

Show all 12 sources

Can rubrics and dense rewards work together without hacking?

DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains4.31 match · arxiv ↗
Reward Reasoning Model4.23 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin3.50 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR3.45 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?2.64 match · arxiv ↗
Learning to Reason without External Rewards2.58 match · arxiv ↗
RM-R1: Reward Modeling as Reasoning2.56 match · arxiv ↗
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey2.56 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher evaluating whether the verifiable/preference-reward distinction remains architecturally real or has begun to dissolve in practice. The question: what separates rewards you can *check* (math correct, code compiles) from rewards you can only *prefer* (this reads better)—and does unified training erase or sharpen that line?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–early 2026.
• RLVR surfaces latent reasoning but doesn't expand it beyond pretraining; base models outperform RLVR models at high pass@k (2025).
• Binary correctness rewards degrade calibration; Brier-score regularization fixes it; ternary rewards (correct/hallucinated/abstain) restore both reasoning and calibration (2025).
• Rubric gates (fuzzy preference judgments) work best as binary accept/reject filters; dense token-level rewards applied post-filter preserve both signal types; direct rubric-to-dense conversion invites gaming (2025–2026).
• Agent feedback decomposes into evaluative (how good) and directive (how to improve) components; scalar rewards collapse these, making them complementary not substitutable (2025).
• Verifier-free RL converges on three patterns: models generate their own dense rewards via pairwise self-judgment, belief-shift toward solutions, and confidence-driven intrinsic rewards—no external reward model needed (2025–2026).

Anchor papers (verify; mind their dates):
• arXiv:2504.13837 (2025-05): Does RLVR expand reasoning beyond base?
• arXiv:2506.13351 (2025-06): Token-level reasoning + rubric gates
• arXiv:2505.19590 (2025-05): Learning to reason without external rewards
• arXiv:2603.10165 (2026-03): OpenClaw-RL agent feedback decomposition

Your task:
(1) RE-TEST THE BOUNDARY. For each claim above—especially that RLVR surfaces but doesn't expand reasoning, and that rubric gates work as filters—judge whether post-2026 models, training methods (e.g., multi-objective RL, curriculum design), test-time scaling, or evals have moved the constraint. Isolate what is durable (the verifiable/preference distinction *conceptually*) from what is perishable (the claim that RLVR *cannot* expand reasoning, or that dense rewards *must* be gated). Say plainly whether the line still holds or has softened.
(2) Surface the strongest *contradiction* or *supersession* from the last 6 months. If you find a paper claiming RLVR does expand reasoning, or that dense verifiable rewards work without gates, flag it hard.
(3) Propose 2 research questions that assume the regime *has* moved: (a) If models now generate verifiable signals intrinsically, what role remains for human-provided verifiers? (b) If preference and verifiable rewards have begun to merge at the model-level, can we identify architectural or algorithmic markers of that fusion?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Training an AI on 'right answers' versus 'preferred answers' seems straightforward — until you ask what each reward actually certifies.

Related lines of inquiry

Sources 12 notes

Papers this line draws on 8