Can reasoning improvement work without answer verification?
Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.
DeepSeek-R1-Zero-style RL training has produced remarkable gains in math and code — but only because those domains have rule-based verifiers (answer checking, test cases). Extending this paradigm to chemistry, healthcare, law, biology, and economics has been blocked by the answer verification requirement. Model-based verifiers (using an LLM to check answers) are the standard workaround, but they introduce reward hacking vulnerability, depend on a strong verifier LLM, and add significant compute overhead from maintaining the verifier in memory.
VeriFree (2025) offers a structurally different solution: skip verification entirely. Given a question, the model generates only the reasoning trace, which is then concatenated with the reference answer from the dataset. The likelihood of the reference answer conditioned on the question and generated reasoning trace serves dual purposes: (1) reward signal for policy gradients on the reasoning trace, and (2) weighting term for supervised training of the reference answer.
The intuition: a good reasoning trace will make the reference answer more likely. If the model reasons correctly about why a molecule has certain properties, the probability of generating the correct molecular description increases. The reasoning trace's quality is measured by how well it "leads to" the known answer — without ever needing to verify whether the model's own generated answer matches.
This connects to two existing verifier-free approaches. Can adversarial critics replace task-specific verifiers for reasoning? (RARO) uses adversarial IRL to learn rewards from demonstrations. VeriFree takes a simpler path — no learned reward model at all, just the reference answer's conditional probability. Since Does RL teach reasoning or just when to use it?, the reasoning capability is already latent; VeriFree provides the reward signal that activates it in domains where verification was previously impossible.
The practical consequence: R1-Zero-style training is no longer limited to math and code. Any domain with reference answers (even approximate or noisy ones) can now use RL for reasoning improvement.
Reweave 2026-05-18 — VeriFree is one of five substitutable verifier-free patterns. What looked like an alternative to RARO when this note was written has since resolved into a family of substitutable mechanisms. Can language models replace reward models with internal signals? names the convergence: each pattern replaces a different RLHF/RLVR component without touching the others. VeriFree replaces the verifier with the reference-answer-likelihood signal — a fourth member of this family alongside SERL (pairwise self-judgment), ΔBelief-RL (internal belief shift), SDPO (rich-feedback self-distillation), and POLAR (similarity-to-target-policy as relational reward). RARO is a fifth via adversarial IRL.
The structural claim that emerges: the reward-signal source is substitutable in much the way RL algorithm choice turned out to be substitutable. Five different verifier-free reward sources converge on similar capability gains because — as Does the choice of RL algorithm actually matter for reasoning? argues at the algorithm level — the binding constraint is the pretrained prior, not the specific source of reward signal. VeriFree's contribution is not that its specific mechanism is uniquely correct but that it confirmed verifier-free reward signals can match verifier-based ones in domains where the verifier was the bottleneck. The fact that four other mechanisms now achieve the same result is consistent with the substitutability thesis, not a refutation of VeriFree's value.
A second consequence of the reweave: VeriFree's design choice — reference-answer-likelihood — is the most general-purpose member of the family because it requires only a reference answer (which most supervised datasets provide). SERL needs pairwise comparability of self-generated responses. ΔBelief-RL needs ground-truth final outcomes during training. SDPO needs rich tokenized environment feedback. POLAR needs a target policy as reference. RARO needs expert demonstrations. VeriFree's requirements are the lightest. This positions it as a default fallback in the verifier-free design space — not necessarily the best, but the most broadly applicable.
Inquiring lines that use this note as a source 44
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does verification of AI outputs face the same circularity problem?
- What verification methods work for knowledge without stable referents?
- Can beam search and ranking functions evaluate claims without understanding counterarguments?
- Why do current RLVR methods fail to expand reasoning capability beyond base model boundaries?
- What replaces truth-correspondence in probabilistic knowledge representations?
- Can external verifiers replace reasoning trace quality in solution guarantees?
- Can verifier-guided search catch factual errors that reasoning training cannot?
- Can AI evaluation tools solve the verification problem they help create?
- Can reasoning skills trained on law improve performance in STEM?
- Can LLMs improve at simple deduction through different training approaches?
- Does reasoning trace style explain why RL post-training improves model reasoning?
- Can Socratic questioning replace external evidence verification in multi-agent systems?
- How does RL refine reasoning paths without simply adding model capability?
- What alternatives exist when required knowledge is absent from training?
- Can training on reasoning traces teach actual self-correction or only confident first answers?
- Can judges trained on both verifiable and non-verifiable tasks transfer across domains?
- Why does search-augmented generation still not solve the verification problem?
- Can multiple verification approaches together overcome the self-improvement ceiling?
- Can reasoning catalyst data serve as a stable foundation for test-time training?
- What alternatives to RLHF better preserve truth-seeking in AI outputs?
- How can one training example improve reasoning across thousands of unseen problems?
- What role do verifiers play in stabilizing extended reasoning at test time?
- Can one training example activate mathematical reasoning in RL-trained models?
- Does the verification gap widen exactly where judgment replaces checkability?
- Can verification loops and decomposition fix judgment failures?
- Does RL amplify existing reasoning or create genuinely new computational strategies?
- Can learned verifiers over token similarity replace dense compositional training?
- How can reasoning quality be verified before integrating new information into a reasoning graph?
- Why does moving verifier synthesis to the LLM extend verification beyond math and code domains?
- Can verifier output replace ground-truth answers as the asymmetric information source?
- How do verifier-free and adversarial approaches compare in extending reasoning RL?
- What makes answer equivalence sufficient to discard a reasoning path?
- Can mathematical reasoning improvements transfer across problem subdomains?
- Can verifier-based objectives preserve reasoning transparency alongside correctness?
- Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
- What does RL post-training actually teach reasoning systems?
- What reasoning tasks are actually checkable through process verification?
- Can verifier-free RL work without manual preference labels or task-specific training?
- How can verifier-free reinforcement learning handle reasoning without task-specific checks?
- How do verifier-free RL patterns differ from traditional RLHF approaches?
- Can approximate or noisy reference answers work for RL-based reasoning training?
- Can RL create new reasoning primitives that pretraining never established?
- How do extrapolative and contextual generalization measure RL reasoning gains?
- Can small demonstration sets unlock general reasoning without large question data?
Related concepts in this collection 9
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can language models replace reward models with internal signals?
Recent RL research shows three independent patterns—self-judgment, belief-shift, and rich feedback—that each eliminate a component of the traditional RLHF stack. Are these patterns converging on a fundamentally different architecture for training without external verifiers?
meta-claim: VeriFree is one of five substitutable verifier-free patterns; design space now legible
-
Can reward models learn by comparing policies instead of judging them?
What if reward models worked as policy discriminators—measuring distance to a target rather than encoding absolute preferences? Could this eliminate the need for manual preference labels and scale across domains?
POLAR's relational reward framing is a different member of the verifier-free family; VeriFree relies on reference-answer likelihood, POLAR on target-policy similarity
-
Can models learn to judge themselves without external rewards?
Can a language model train itself by alternating between generating responses and evaluating them using only internal consistency signals? This explores whether evaluation itself can become a learnable skill without external supervision.
SERL's pairwise self-judgment is another verifier-free pattern; works where no reference answer is available
-
Can an agent's own beliefs guide credit assignment without critics?
Explore whether an agent's shifting probability estimates toward the correct answer could serve as a self-contained reward signal for long-horizon reinforcement learning, eliminating the need for separate process reward models or external verifiers.
ΔBelief-RL uses the same reference answer that VeriFree uses but extracts a denser per-turn signal from it (belief shift, not just final likelihood)
-
Can adversarial critics replace task-specific verifiers for reasoning?
Explores whether an adversarial game between policy and critic can substitute for explicit verifiers in RL-based reasoning training. Matters because many domains lack the task-specific validators that make current reasoning RL possible.
alternative verifier-free approach via IRL; VeriFree uses reference-answer likelihood instead
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
VeriFree confirms reasoning is latent, just needs appropriate reward signal
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
VeriFree provides a minimal signal (reference-answer likelihood) that unlocks reasoning
-
Why doesn't mathematical reasoning transfer to medicine?
Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.
VeriFree provides RL path for domain-specific reasoning where SFT fails
-
Can model confidence alone replace external answer verification?
Can LLMs use their own certainty signals instead of external verifiers to improve reasoning? This matters for scaling beyond domains where correct answers can be automatically checked.
RLPR and INTUITOR extend the verifier-free progression further: VeriFree conditions on reference-answer likelihood, RLPR uses intrinsic token probabilities, INTUITOR uses pure self-certainty — progressively weaker assumptions about required external signal
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reinforcing General Reasoning without Verifiers
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
- RLPR: Extrapolating RLVR to General Domains without Verifiers
- Local Coherence or Global Validity? Investigating RLVR Traces in Math Domains
- Escaping the Verifier: Learning to Reason via Demonstrations
- Premise-Augmented Reasoning Chains Improve Error Identification in Math reasoning with LLMs
- The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
- The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Original note title
verifier-free rl extends reasoning reinforcement to general domains by conditioning on reference answer likelihood rather than verifying generated answers