SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation

Can reasoning improvement work without answer verification?

Explores whether RL-based reasoning training can extend beyond math and code to general domains like chemistry and law by replacing answer verification with a simpler signal based on reference answer likelihood.

Synthesis note · 2026-02-22 · sourced from Reward Models
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

DeepSeek-R1-Zero-style RL training has produced remarkable gains in math and code — but only because those domains have rule-based verifiers (answer checking, test cases). Extending this paradigm to chemistry, healthcare, law, biology, and economics has been blocked by the answer verification requirement. Model-based verifiers (using an LLM to check answers) are the standard workaround, but they introduce reward hacking vulnerability, depend on a strong verifier LLM, and add significant compute overhead from maintaining the verifier in memory.

VeriFree (2025) offers a structurally different solution: skip verification entirely. Given a question, the model generates only the reasoning trace, which is then concatenated with the reference answer from the dataset. The likelihood of the reference answer conditioned on the question and generated reasoning trace serves dual purposes: (1) reward signal for policy gradients on the reasoning trace, and (2) weighting term for supervised training of the reference answer.

The intuition: a good reasoning trace will make the reference answer more likely. If the model reasons correctly about why a molecule has certain properties, the probability of generating the correct molecular description increases. The reasoning trace's quality is measured by how well it "leads to" the known answer — without ever needing to verify whether the model's own generated answer matches.

This connects to two existing verifier-free approaches. Can adversarial critics replace task-specific verifiers for reasoning? (RARO) uses adversarial IRL to learn rewards from demonstrations. VeriFree takes a simpler path — no learned reward model at all, just the reference answer's conditional probability. Since Does RL teach reasoning or just when to use it?, the reasoning capability is already latent; VeriFree provides the reward signal that activates it in domains where verification was previously impossible.

The practical consequence: R1-Zero-style training is no longer limited to math and code. Any domain with reference answers (even approximate or noisy ones) can now use RL for reasoning improvement.

Reweave 2026-05-18 — VeriFree is one of five substitutable verifier-free patterns. What looked like an alternative to RARO when this note was written has since resolved into a family of substitutable mechanisms. Can language models replace reward models with internal signals? names the convergence: each pattern replaces a different RLHF/RLVR component without touching the others. VeriFree replaces the verifier with the reference-answer-likelihood signal — a fourth member of this family alongside SERL (pairwise self-judgment), ΔBelief-RL (internal belief shift), SDPO (rich-feedback self-distillation), and POLAR (similarity-to-target-policy as relational reward). RARO is a fifth via adversarial IRL.

The structural claim that emerges: the reward-signal source is substitutable in much the way RL algorithm choice turned out to be substitutable. Five different verifier-free reward sources converge on similar capability gains because — as Does the choice of RL algorithm actually matter for reasoning? argues at the algorithm level — the binding constraint is the pretrained prior, not the specific source of reward signal. VeriFree's contribution is not that its specific mechanism is uniquely correct but that it confirmed verifier-free reward signals can match verifier-based ones in domains where the verifier was the bottleneck. The fact that four other mechanisms now achieve the same result is consistent with the substitutability thesis, not a refutation of VeriFree's value.

A second consequence of the reweave: VeriFree's design choice — reference-answer-likelihood — is the most general-purpose member of the family because it requires only a reference answer (which most supervised datasets provide). SERL needs pairwise comparability of self-generated responses. ΔBelief-RL needs ground-truth final outcomes during training. SDPO needs rich tokenized environment feedback. POLAR needs a target policy as reference. RARO needs expert demonstrations. VeriFree's requirements are the lightest. This positions it as a default fallback in the verifier-free design space — not necessarily the best, but the most broadly applicable.

Inquiring lines that use this note as a source 44

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 9

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 141 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

verifier-free rl extends reasoning reinforcement to general domains by conditioning on reference answer likelihood rather than verifying generated answers