Can adversarial critics replace task-specific verifiers for reasoning?
Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
A fundamental limitation of RL for reasoning: RLVR requires task-specific verifiers (math checkers, code test suites) that don't exist for many reasoning-intensive domains. Expert demonstrations are abundant (Stack Exchange answers, domain expert explanations) but SFT on demonstrations doesn't produce the reasoning behaviors that large-scale RL training elicits. RARO bridges this gap using inverse reinforcement learning.
The mechanism is an adversarial game. A policy learns to produce expert-like answers via explicit CoT reasoning. A relativistic critic learns to discriminate between expert and policy answers via pairwise comparison. Both are trained jointly and continuously via RL, requiring careful stabilization techniques. The critic's discrimination signal serves as the reward for the policy — when the critic can't distinguish policy from expert, the policy has learned expert-level reasoning.
The results are significant: RARO outperforms strong verifier-free baselines on Countdown, DeepMath, and Poetry Writing, and enjoys the same robust scaling trends as RL with verifiers. This means the scaling properties of RLVR are not specific to verifiable rewards — they emerge from the RL training dynamics themselves, with the adversarial critic providing a sufficient substitute for ground-truth verification.
This extends the frontier of RL-for-reasoning to any domain with expert demonstrations. Since Does critiquing errors teach deeper understanding than imitating correct answers?, RARO leverages a similar mechanism — the adversarial training forces the model to develop genuine reasoning rather than surface-level imitation, because the critic can distinguish superficial pattern matching from actual expert-like problem solving.
VeriFree as a second verifier-free approach: VeriFree takes a different route to the same goal — extending R1-Zero-style RL training to domains without rule-based verifiers. Instead of an adversarial critic, VeriFree generates only the reasoning trace and concatenates it with the reference answer, then evaluates the likelihood of the reference answer conditioned on both. This likelihood serves as both a reward signal for policy gradients on the reasoning trace and a weighting term for supervised training. VeriFree is architecturally simpler than RARO (no adversarial game) and eliminates the need for even a model-based verifier, reducing compute overhead. See Can reasoning improvement work without answer verification?. The two approaches bracket the design space: RARO uses adversarial dynamics for richer signal, VeriFree uses reference-conditioned likelihood for simplicity.
Inquiring lines that use this note as a source 30
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does verification of AI outputs face the same circularity problem?
- How does low verifiability change what we can measure in AI work?
- Why does a relativistic critic outperform absolute scoring in adversarial reasoning training?
- What stability techniques prevent collapse in policy-critic adversarial training?
- Can minimal adversarial triggers disrupt reasoning across multiple unrelated queries?
- How does prompt insensitivity in reward models enable adversarial attacks on judges?
- Can judges trained on both verifiable and non-verifiable tasks transfer across domains?
- What infrastructure could replace search for verifying AI outputs?
- Does training on self-play disagreement data improve multi-agent reasoning outcomes?
- Can architectural changes like adversarial agent roles prevent silent agreement?
- What alternatives to RLHF better preserve truth-seeking in AI outputs?
- Can adversarial critics force genuine reasoning the same way critique fine-tuning does?
- What role do verifiers play in stabilizing extended reasoning at test time?
- How should humans specify deterministic abstractions of RL problems?
- Can automated tools close the gap between AI generation and verification?
- How does adversarial collapse threaten unsupervised self-play skill construction?
- Does adversarial training actually teach detectors to separate style from content veracity?
- Why can generative verifiers scale verification compute more effectively than fixed-output discriminative models?
- Can verification tools keep pace with AI artifact generation speed?
- Can verifier output replace ground-truth answers as the asymmetric information source?
- How do verifier-free and adversarial approaches compare in extending reasoning RL?
- Why does adversarial training force deeper reasoning than surface imitation?
- Can verifier-based objectives preserve reasoning transparency alongside correctness?
- How can verifiers check policy compliance in agentic reasoning tasks?
- How do adversarial IRL and policy discrimination differ in rejecting preference labels?
- Can verifier-free RL work without manual preference labels or task-specific training?
- How can verifier-free reinforcement learning handle reasoning without task-specific checks?
- How do verifier-free RL patterns differ from traditional RLHF approaches?
- Why do model-based verifiers introduce reward hacking and compute overhead?
- Can approximate or noisy reference answers work for RL-based reasoning training?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
both reject the labeled-preference bottleneck via discrimination: RARO uses adversarial discrimination against demonstrations; POLAR uses similarity-to-target-policy. Different mechanisms but same anti-labeled-preference move
-
Can language models replace reward models with internal signals?
Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?
RARO's adversarial-IRL approach forms a fifth substitutable pattern alongside SERL, ΔBelief-RL, SDPO, and POLAR — each replaces a different RLHF/RLVR component
-
Can reasoning improvement work without answer verification?
Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.
complementary verifier-free approach: reference likelihood instead of adversarial critic
-
Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
parallel mechanism: adversarial discrimination forces deeper understanding than pure imitation
-
Can simple rewards alone teach complex domain reasoning?
Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.
extends: the "simple objective" can be adversarial critic score, not just verifiable correctness
-
Can self-supervised process rewards replace human annotation?
Self-supervised PRMs learn from outcome labels alone, avoiding expensive step-level annotation. The key question is whether this approach generalizes beyond math and code to domains with ambiguous correctness.
related: both reduce dependence on external annotation for RL training
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Escaping the Verifier: Learning to Reason via Demonstrations
- RLPR: Extrapolating RLVR to General Domains without Verifiers
- Reinforcing General Reasoning without Verifiers
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
- Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Original note title
inverse rl from demonstrations enables reasoning training without task-specific verifiers