Can reasoning emerge from expert demonstrations alone?
Can AI systems learn to reason about non-verifiable tasks by studying expert examples rather than explicit reward signals? This matters because many high-value domains like medicine and law have abundant demonstrations but no automated verifiers.
RLVR requires verifiable rewards. Many real-world reasoning tasks lack verifiers but have abundant expert demonstrations (Stack Exchange answers, medical case notes, legal analyses). RARO (Relativistic Adversarial Reasoning Optimization) bridges this gap through Inverse Reinforcement Learning: instead of defining a reward function, it recovers one from expert behavior.
The framework sets up an adversarial game between two co-trained components:
- A reasoning policy that learns to produce expert-level answers via explicit Chain-of-Thought reasoning
- A relativistic critic that learns to discriminate between expert and policy answers via pairwise comparison
Both are trained jointly and continuously via RL. The policy improves at producing expert-like outputs; the critic improves at distinguishing them. The adversarial dynamic creates an implicit reward function grounded in expert demonstrations rather than explicit rules.
RARO significantly outperforms verifier-free baselines on Countdown, DeepMath, and Poetry Writing, and enjoys the same robust scaling trends as RL with verifiers. This demonstrates that strong reasoning can emerge from demonstrations alone — the verifier is not a prerequisite for RL-trained reasoning, just the most convenient reward source.
The key stabilization techniques matter: naive adversarial training is notoriously unstable. The "relativistic" critic — performing pairwise comparison rather than absolute scoring — and careful training choreography are required for robust learning.
Since Can adversarial critics replace task-specific verifiers for reasoning?, RARO provides the full implementation and stability analysis. Since What limits how much models can improve themselves?, RARO partially circumvents this bound: the critic co-evolves with the policy rather than remaining static, though the expert demonstrations set an ultimate quality ceiling.
The practical implication: domains rich in expert examples but lacking automated verification (medical reasoning, legal analysis, scientific writing) can now benefit from RL-trained reasoning — previously exclusive to math and code.
Inquiring lines that use this note as a source 11
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Can AI output be verified without understanding the reasoning behind it?
- Can diverse expert demonstrations exceed the knowledge of any single expert?
- How does low verifiability change what we can measure in AI work?
- How does the expert demonstration ceiling compare to the generation-verification gap bound?
- What alternatives exist when required knowledge is absent from training?
- Can judges trained on both verifiable and non-verifiable tasks transfer across domains?
- How does correctness emergence occur when no expert initially solved the task?
- Can artificial systems develop the authority to challenge expert claims?
- What implicit warrants do expert arguments rely on that AI cannot reliably access?
- Does the 78-demonstration principle apply to other AI capabilities beyond agency?
- How can verifier-free reinforcement learning handle reasoning without task-specific checks?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can adversarial critics replace task-specific verifiers for reasoning?
Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
RARO is the full adversarial implementation
-
Why do self-improvement loops eventually stop improving?
Self-improvement systems often plateau because the evaluator that judges progress stays static while the actor grows. What happens when judges don't improve alongside learners?
RARO's co-trained critic operationalizes this principle
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
expert demonstrations set the ceiling rather than generation-verification gap
-
Does critiquing errors teach deeper understanding than imitating correct answers?
Can training models to critique flawed responses build better structural understanding than standard supervised fine-tuning on correct answers? This matters because it reveals whether deep reasoning requires engaging with failure modes rather than pattern matching.
the critic component learns evaluation through adversarial training
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Escaping the Verifier: Learning to Reason via Demonstrations
- The Invisible Leash: Why RLVR May Not Escape Its Origin
- Reinforcing General Reasoning without Verifiers
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
- RLPR: Extrapolating RLVR to General Domains without Verifiers
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
Original note title
inverse rl from expert demonstrations enables reasoning in non-verifiable domains through adversarial policy-critic co-training