INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›How should retrieval-augmented gen…›How do adversarial and manipulativ…›this inquiring line

An AI critic that compares answers against each other trains sharper reasoning than one assigning fixed scores.

Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?

This explores why a critic that judges answers comparatively — distinguishing expert outputs from a policy's own attempts — tends to beat a critic that hands out fixed absolute scores, in the context of adversarial RL for reasoning.

This explores why a relational critic (one that asks 'is this answer better or worse than the alternative?') outperforms one that assigns an absolute number, and the corpus points to a consistent culprit: absolute scoring is brittle in exactly the places reasoning training stresses it most. The clearest anchor is RARO's adversarial setup Can adversarial critics replace task-specific verifiers for reasoning?, where a critic learns to discriminate expert answers from the policy's answers rather than verify correctness against a fixed rubric. Because the critic only has to judge *relative* quality, it sidesteps the need for a domain-specific verifier — and crucially, the target it's chasing moves with the policy, so there's never a fixed score to game.

Why does a fixed score get gamed? Two notes show the failure mechanically. Binary correctness rewards — the purest form of absolute scoring — provably wreck calibration, because a confident wrong answer costs exactly as much as a humble wrong one, so the optimal move is to guess loudly Does binary reward training hurt model calibration?. And group-relative normalization, which is supposed to soften this, backfires when scores are sparse: a rare accidental success on an impossible problem gets treated as a high-advantage trajectory, and the model learns shortcuts and answer-repetition instead of reasoning Do overly hard RLVR samples actually harm model capabilities?. Absolute targets reward whatever crosses the threshold, including degenerate paths.

The deeper reason a relativistic critic helps connects to what RL is actually doing to reasoning at all. Multiple lines of work find that RLVR doesn't expand a model's reasoning frontier — it sharpens sampling toward solutions already latent in the base model Does RLVR actually expand what models can reason about? What does reward learning actually do to model reasoning?, and base models turn out to carry far more latent capability than their default outputs suggest Do base models already contain hidden reasoning ability?. If the job is *selection* rather than teaching, then a critic that contrasts good against bad is doing the natural thing — discriminating — whereas an absolute scorer has to invent a correctness signal it doesn't really have, which is where spurious rewards and shortcut amplification creep in.

There's also an adversarial-robustness angle the question's framing invites. Reasoning models are startlingly fragile to absolute-looking signals: appending irrelevant text spikes error rates How vulnerable are reasoning models to irrelevant text?, manipulative multi-turn prompts knock 25–29% off accuracy Why do reasoning models fail under manipulative prompts?, and a model can ace every fixed benchmark while its internal representation is incoherent Can AI pass every test while understanding nothing?. A fixed scorer inherits all of these blind spots as exploitable surface; a critic trained to tell expert from policy keeps adapting as the policy finds new tricks, which is the whole point of making the contest adversarial.

The thing worth taking away: the win isn't really about 'relative vs absolute math.' It's that absolute scoring assumes you possess ground truth about quality — and across this corpus, that assumption is exactly what fails, whether through calibration collapse Does binary reward training hurt model calibration?, reward-hacking, or benchmarks that can't see inside the model. A relativistic critic survives because it never claims to know the right answer in the absolute — only which of two attempts is closer to one. For the contrast case where you *do* want to shape the reasoning process rather than just rank outcomes, see how metacognitive process rewards earn their signal differently Can RL agents learn to reason better, not just succeed?.

Sources 10 notes

Can adversarial critics replace task-specific verifiers for reasoning?

RARO uses an adversarial game where a critic discriminates expert from policy answers, eliminating the need for domain-specific verifiers while matching the scaling properties of verifier-based RL. The approach works across Countdown, DeepMath, and Poetry Writing tasks.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Show all 10 sources

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

How vulnerable are reasoning models to irrelevant text?

Appending semantically unrelated sentences to math problems significantly increases error rates in reasoning models. These query-agnostic triggers discovered on cheaper models transfer effectively to stronger models and also inflate response length.

Why do reasoning models fail under manipulative prompts?

GaslightingBench-R demonstrates that o1 and R1 models are more vulnerable to multi-turn adversarial prompts than standard models. Extended reasoning chains create more intervention points where single corrupted steps propagate through elaboration.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?4.33 match · arxiv ↗
The Invisible Leash: Why RLVR May Not Escape Its Origin3.45 match · arxiv ↗
Escaping the Verifier: Learning to Reason via Demonstrations2.61 match · arxiv ↗
Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains2.57 match · arxiv ↗
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning2.51 match · arxiv ↗
Absolute Zero: Reinforced Self-play Reasoning with Zero Data2.51 match · arxiv ↗
Spurious Rewards: Rethinking Training Signals in RLVR1.76 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools1.75 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a critical LLM researcher re-testing claims about critic design in reasoning training. The question remains open: why does relative scoring outperform absolute scoring in adversarial reasoning contexts?

What a curated library found — and when (dated claims, not current truth):
Findings span Sept 2024–May 2026. The library identified these constraints:
• Binary correctness rewards provably degrade calibration; confident wrong answers incentivize gaming rather than reasoning (~2024-09).
• Group-relative normalization on sparse scores causes shortcut amplification and answer repetition instead of genuine reasoning (~2024-09).
• RLVR does not expand reasoning capability beyond the base model's latent capacity — it only sharpens selection (~2025-04).
• Reasoning models are fragile to absolute-looking signals: irrelevant text appends spike errors 300%, manipulative prompts reduce accuracy 25–29% (~2025-03).
• A critic trained adversarially to discriminate expert from policy output adapts as the policy evolves, whereas fixed scorers inherit blind spots (~2025-07).

Anchor papers (verify; mind their dates):
• arXiv:2409.15360 (2024-09): Reward-Robust RLHF in LLMs
• arXiv:2504.13837 (2025-04): Does Reinforcement Learning Really Incentivize Reasoning Capacity
• arXiv:2507.22844 (2025-07): RLVMR with Verifiable Meta-Reasoning Rewards
• arXiv:2605.28388 (2026-05): Mechanistically Interpreting Sample Difficulty in RLVR

Your task:
(1) RE-TEST EACH CONSTRAINT. For calibration collapse, shortcut amplification, and latent-capacity ceilings: has adversarial critiquing or new training orchestrations (e.g., multi-round discriminator updates, curriculum on policy diversity) since relaxed these limits? Judge which remain genuine bottlenecks vs. which newer methods dissolve. Separate the durable question (why is *relative* ranking robust?) from the perishable claim (absolute scoring is *always* brittle).
(2) Surface the strongest DISAGREEMENT or SUPERSEDING work from the last 6 months — papers that argue absolute scoring can work under specific conditions, or that reframe the problem away from relative vs. absolute entirely.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., can you build an absolute scorer that *learns* its own ground truth dynamically? Can adversarial relative critics scale beyond pairwise contrasts to reasoning over n-way alternatives?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI critic that compares answers against each other trains sharper reasoning than one assigning fixed scores.

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8