SYNTHESIS NOTE

Can models that reason well also grade reasoning well?

Do the same capabilities that let language models produce valid reasoning also let them spot flawed reasoning? Testing this assumption reveals a surprising gap between production and evaluation skills.

Synthesis note · 2026-06-27 · sourced from MechInterp

The intuitive picture treats "reasoning" as one faculty that scales together: a model that can produce a valid proof should be able to spot an invalid one. This paper breaks that assumption. Using the Valid-Answer-Invalid-Reasoning (VAIR) dataset — math problems whose solutions reach the correct answer through a trivially flawed step — it isolates evaluation from production. Humans are only ~6% worse at grading than solving; frontier large reasoning models score as low as 48% at grading despite near-perfect solving. Evaluation and production are not the same muscle, and current training only exercises one.

The mechanism is an answer-confirmation bias visible at two levels. Behaviorally, CoT analysis shows models produce-then-check for the right answer rather than auditing each step, fabricating rationalizations even after noticing an anomaly. Representationally, linear probes find activations encode some notion of valid reasoning but fail to robustly represent VAIR solutions as invalid. The authors attribute this to outcome-focused objectives: if reward lands on the final answer, the model is never pressured to penalize a correct answer reached badly.

This sharpens the vault's jaggedness thread. Since Why do reasoning models fail at exception-based rule inference?, we already have evidence that adding a reasoning process can degrade a capability; the production-evaluation gap names a second axis where the same model is strong and weak at once. It also complicates the verifier-as-savior story: since Can reward models benefit from reasoning before scoring? assumes a reasoning model makes a good grader, but a grader carrying answer-confirmation bias may rationalize rather than verify — the paper warns this contaminates PRM training too.

The strongest counterargument: VAIR's flaws are deliberately trivial-yet-answer-preserving, an adversarial slice that may not reflect normal evaluation loads. But that is the point — the easy cases are exactly where a genuine evaluator should never fail.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Why do invalid reasoning prompts work as well as valid ones?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 134 in 2-hop network ·dense cluster Open in graph ↗

Can models that reason well also grade reasoning… Why do reasoning models fail at exception-based ru… Can reward models benefit from reasoning before sc… Does gradually tightening token budgets beat fixed…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Why do reasoning models fail at exception-based rule inference? Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.
convergent-with: another axis of jaggedness where the reasoning process itself introduces failure
Can reward models benefit from reasoning before scoring? Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.
contradicts the optimism that reasoning-traced graders verify well; answer-confirmation bias makes them rationalize
Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
grounds: both trace the failure to outcome-focused RL objectives

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

producing reasoning and evaluating reasoning are separable capabilities — outcome-reward training buys the first while starving the second

Can models that reason well also grade reasoning well?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4