SYNTHESIS NOTE

Can models that reason well also grade reasoning well?

Do the same capabilities that let language models produce valid reasoning also let them spot flawed reasoning? Testing this assumption reveals a surprising gap between production and evaluation skills.

Synthesis note · 2026-06-27 · sourced from MechInterp

The intuitive picture treats "reasoning" as one faculty that scales together: a model that can produce a valid proof should be able to spot an invalid one. This paper breaks that assumption. Using the Valid-Answer-Invalid-Reasoning (VAIR) dataset — math problems whose solutions reach the correct answer through a trivially flawed step — it isolates evaluation from production. Humans are only ~6% worse at grading than solving; frontier large reasoning models score as low as 48% at grading despite near-perfect solving. Evaluation and production are not the same muscle, and current training only exercises one.

The mechanism is an answer-confirmation bias visible at two levels. Behaviorally, CoT analysis shows models produce-then-check for the right answer rather than auditing each step, fabricating rationalizations even after noticing an anomaly. Representationally, linear probes find activations encode some notion of valid reasoning but fail to robustly represent VAIR solutions as invalid. The authors attribute this to outcome-focused objectives: if reward lands on the final answer, the model is never pressured to penalize a correct answer reached badly.

This sharpens the vault's jaggedness thread. Since Why do reasoning models fail at exception-based rule inference?, we already have evidence that adding a reasoning process can degrade a capability; the production-evaluation gap names a second axis where the same model is strong and weak at once. It also complicates the verifier-as-savior story: since Can reward models benefit from reasoning before scoring? assumes a reasoning model makes a good grader, but a grader carrying answer-confirmation bias may rationalize rather than verify — the paper warns this contaminates PRM training too.

The strongest counterargument: VAIR's flaws are deliberately trivial-yet-answer-preserving, an adversarial slice that may not reflect normal evaluation loads. But that is the point — the easy cases are exactly where a genuine evaluator should never fail.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 134 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

producing reasoning and evaluating reasoning are separable capabilities — outcome-reward training buys the first while starving the second