INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do self-generated feedback mec…›this inquiring line

AI can train itself as long as checking an answer is easier than finding one — but where does that advantage run out?

At what capability level does the generation-verification gap make intrinsic rewards insufficient?

This explores the boundary condition where a model's ability to check its own answers stops outpacing its ability to produce them — the point past which a model training on its own internal reward signal can no longer improve and needs outside help.

This is really a question about a single asymmetry: can a model judge a candidate answer more reliably than it can generate one? When verification is the easier half of that pair — checkable math, code that runs, problems where wrong answers are obvious — intrinsic signals like self-consistency carry a model a long way. The generation-verification gap makes intrinsic rewards insufficient precisely at the frontier where that asymmetry flips: where the hardest problems a model faces are ones it cannot verify any better than it can solve. Can models reliably improve themselves without external feedback? makes this the load-bearing point — pure self-improvement stalls there, and every method that actually keeps working quietly smuggles in an external anchor: a frozen past version, a third-party judge, a user correction, a tool that returns ground truth. The 'capability level' in the question isn't a fixed model size; it's wherever a given problem sits relative to that model's own verification ceiling.

What sharpens this is that even *external* verifiable rewards don't buy you new capability — so intrinsic ones certainly can't. Does RLVR actually expand what models can reason about? shows via pass@k that RLVR narrows sampling toward solutions already living in the base model's distribution rather than expanding the set of solvable problems, and What does reward learning actually do to model reasoning? drives it home: a single example can trigger the gains, and *spurious* rewards work nearly as well as correct ones. That's the giveaway — if a wrong reward and a right reward produce the same lift, the reward isn't teaching anything; it's activating pretraining. So an intrinsic reward, which at best approximates a correct external one, is structurally capped at re-sorting what the model already knows. The moment a task requires reasoning patterns outside that base distribution, you've left what any self-generated signal can reach — and only genuine transfer (distillation) crosses that line.

The interesting escape hatch is to raise the verification side of the gap rather than accept it as fixed. Can reward models benefit from reasoning before scoring? and Can generative reasoning beat discriminative models with less training data? both show that letting the judge *reason* before it scores — spending test-time compute on evaluation — lifts the verification ceiling well past what a snap outcome-judgment achieves (a 1.5B generative verifier beating GPT-4o on a fraction of the labels). That reframes the answer to the question: intrinsic rewards become insufficient not at some absolute capability level but wherever generation has outrun a *non-reasoning* verifier. Make the verifier think, and you push the insufficiency threshold higher.

A few notes complicate the picture in useful ways. Does binary reward training hurt model calibration? shows a crude intrinsic-style signal (binary correctness) doesn't just plateau — it actively corrupts the model into confident guessing, because it never punishes confident wrong answers. And Can scalar rewards capture all the information in agent feedback? argues a scalar reward throws away the *directive* half of feedback (how to change) and keeps only the *evaluative* half (how good) — so part of what makes self-rewarding insufficient is that the reward format itself is lossy, independent of capability. The takeaway a reader might not expect: 'intrinsic rewards run out' is less a wall at a fixed skill level and more a relationship you can renegotiate — by anchoring to something external, by making the verifier reason, or by using a richer feedback signal than a single number.

Sources 7 notes

Can models reliably improve themselves without external feedback?

Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can reward models benefit from reasoning before scoring?

Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.

Can generative reasoning beat discriminative models with less training data?

GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.

Show all 7 sources

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Reward Reasoning Model3.39 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin1.78 match · arxiv ↗
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?1.78 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.76 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains1.75 match · arxiv ↗
StepWiser: Stepwise Generative Judges for Wiser Reasoning1.71 match · arxiv ↗
Reasoning Language Models: A Blueprint1.69 match · arxiv ↗
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge1.69 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability-frontier analyst. The question: *At what capability level does the generation-verification gap make intrinsic rewards insufficient?* — still open, especially as verifier reasoning and multi-agent orchestration evolve.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 and center on a core asymmetry: intrinsic rewards plateau wherever a model cannot verify better than it generates.
- Pure self-improvement stalls at the frontier where verification ceiling equals generation ceiling; every working method smuggles in external anchor (frozen checkpoint, third-party judge, user correction, tool ground-truth) (~2024–2025).
- RLVR doesn't expand reasoning beyond base-model distribution; it re-sorts pretraining via pass@k narrowing; spurious rewards work nearly as well as correct ones, suggesting reward type is activating, not teaching (~2025).
- Reasoning verifiers (test-time compute on evaluation) lift verification ceiling past outcome-judgment: 1.5B generative verifier outperforms GPT-4o on subset of labels, pushing insufficiency threshold higher (~2025).
- Binary intrinsic signals corrupt calibration (confident guessing); scalar rewards discard directive feedback (how-to-change), retaining only evaluative (how-good), making format itself lossy (~2024–2025).
- Agent next-state signals decompose into evaluative + directive channels; intrinsic rewards capture only the former (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2412.02674 (Mind the Gap; 2024-12)
- arXiv:2504.13837 (Does RLVR expand reasoning; 2025-04)
- arXiv:2505.14674 (Reward Reasoning Model; 2025-05)
- arXiv:2507.14843 (Invisible Leash; 2025-07)

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For every claim above, judge whether newer models (reasoning-heavy chains, o1-style test-time scaling), verifier architectures (multi-step reward reasoning, ensemble judges), orchestration (multi-agent verification pipelines, memory-augmented critics), or evaluation harnesses have since relaxed or overturned it. Separate the durable question (likely: *can intrinsic rewards ever cross a true capability boundary, or only re-sort existing competence?*) from perishable limitations (e.g., *non-reasoning verifiers suffice*). State plainly where each constraint still holds and what would falsify it.
(2) **Surface strongest CONTRADICTING or SUPERSEDING work** from the last ~6 months. Look for papers showing intrinsic rewards *do* enable novel reasoning, or showing verifier-reasoning doesn't materially help, or showing RL itself is becoming less necessary.
(3) **Propose 2 research questions** that assume the regime *has* shifted:
   - If verifier reasoning truly relaxes the insufficiency boundary, does the limit now shift to the *cost* of evaluation rather than the *difficulty* of verification?
   - Do multi-agent verification ensembles (e.g., critic armies) eventually replicate external ground-truth well enough to unplug the external anchor, or is circular verification fundamentally brittle?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI can train itself as long as checking an answer is easier than finding one — but where does that advantage run out?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8