Why can generative verifiers scale verification compute more effectively than fixed-output discriminative models?
This explores why verifiers that *reason out loud* before judging an answer (generative) get more mileage out of extra compute than verifiers that emit a single fixed score (discriminative).
This explores why verifiers that *reason out loud* before judging an answer get more mileage out of extra compute than verifiers that just emit a fixed pass/fail score. The short version from the corpus: a generative verifier can spend its compute thinking, while a discriminative one can only spend it on a single forward pass that ends in one number — there's nowhere for more compute to go.
The clearest evidence is direct. When process reward models are reframed as a generative task — write a chain of thought, *then* judge — they beat discriminative scorers with orders of magnitude less training data; a 1.5B generative model outscores GPT-4o, and one variant matches full-dataset discriminative verifiers using just 1% of the labels Can generative reasoning beat discriminative models with less training data?. The reason this scales is that the verification step becomes a reasoning step, and reasoning is exactly where extra inference tokens pay off. A separate result shows non-reasoning models can't close the gap with reasoning models no matter how large their inference budget — extra tokens are only productive if the model was trained to use them as a reasoning protocol Can non-reasoning models catch up with more compute?. A fixed-output discriminator is the verification-side version of that non-reasoning model: more compute doesn't buy it more deliberation.
There's a deeper structural reason this matters, and it's the most interesting thread here. Self-improvement in language models is formally bounded by the *generation-verification gap*: a model can only reliably improve as far as something can verify its outputs What stops large language models from improving themselves?. If your verifier is a frozen discriminator, the ceiling is fixed wherever that discriminator tops out. A verifier that can itself reason — and scale by thinking harder on hard cases — is a moving ceiling, which is why so much recent work pours compute into the verification side rather than the generation side.
The corpus also shows this isn't only about adding chain-of-thought. Generative verification can be made adversarial — a critic that learns to discriminate expert from policy answers replaces hand-built task-specific verifiers entirely while keeping the same scaling behavior Can adversarial critics replace task-specific verifiers for reasoning?. It can be reframed as a game where a generator and discriminator negotiate to equilibrium, letting 7B models match 540B performance with no fine-tuning Can generative and discriminative models reach agreement?. And it can run *asynchronously alongside* generation, forking off to check verifiable state and intervening only on violations, so the extra verification compute costs almost nothing on correct runs Can verifiers monitor reasoning without slowing generation down?.
The thing you might not have known you wanted to know: the most reliable verifiers in this corpus aren't more neural at all — generative reasoning can be used to *write* formal checkers, auto-synthesizing provably correct Lean and z3 verifiers from prose policy documents Can we automatically generate formal verifiers from policy text?. That's the logical endpoint of the generative advantage: a discriminator can only output a verdict, but a verifier that reasons can output an entire verification *program* — and a program scales its confidence to certainty.
Sources 7 notes
GenPRM and ThinkPRM reframe process supervision as generative tasks with CoT reasoning before judgment, achieving superior performance on far fewer labels. A 1.5B GenPRM beats GPT-4o; ThinkPRM uses only 1% of PRM800K labels to surpass full-dataset discriminative verifiers.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.
The Consensus Game frames decoding as a signaling game where generator and discriminator must agree on answers. Equilibrium-Ranking finds their joint policy, enabling 7B models to match 540B model performance without fine-tuning.
Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.
interwhen automatically generates code-based verifiers—including provably correct Lean and z3 checkers—from prose policy documents. This inverts the usual neuro-symbolic division: the LLM both translates policy to formal logic and extracts verifier inputs from reasoning traces.