Can models that reason well also grade reasoning well?
Do the same capabilities that let language models produce valid reasoning also let them spot flawed reasoning? Testing this assumption reveals a surprising gap between production and evaluation skills.
The intuitive picture treats "reasoning" as one faculty that scales together: a model that can produce a valid proof should be able to spot an invalid one. This paper breaks that assumption. Using the Valid-Answer-Invalid-Reasoning (VAIR) dataset — math problems whose solutions reach the correct answer through a trivially flawed step — it isolates evaluation from production. Humans are only ~6% worse at grading than solving; frontier large reasoning models score as low as 48% at grading despite near-perfect solving. Evaluation and production are not the same muscle, and current training only exercises one.
The mechanism is an answer-confirmation bias visible at two levels. Behaviorally, CoT analysis shows models produce-then-check for the right answer rather than auditing each step, fabricating rationalizations even after noticing an anomaly. Representationally, linear probes find activations encode some notion of valid reasoning but fail to robustly represent VAIR solutions as invalid. The authors attribute this to outcome-focused objectives: if reward lands on the final answer, the model is never pressured to penalize a correct answer reached badly.
This sharpens the vault's jaggedness thread. Since Why do reasoning models fail at exception-based rule inference?, we already have evidence that adding a reasoning process can degrade a capability; the production-evaluation gap names a second axis where the same model is strong and weak at once. It also complicates the verifier-as-savior story: since Can reward models benefit from reasoning before scoring? assumes a reasoning model makes a good grader, but a grader carrying answer-confirmation bias may rationalize rather than verify — the paper warns this contaminates PRM training too.
The strongest counterargument: VAIR's flaws are deliberately trivial-yet-answer-preserving, an adversarial slice that may not reflect normal evaluation loads. But that is the point — the easy cases are exactly where a genuine evaluator should never fail.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Why do reasoning models fail at exception-based rule inference?
Explores why chain-of-thought models systematically underperform on tasks requiring inductive rule inference from exceptions in game-based settings, despite excelling at normal rule patterns.
convergent-with: another axis of jaggedness where the reasoning process itself introduces failure
-
Can reward models benefit from reasoning before scoring?
Does allowing evaluator models to generate reasoning traces before producing reward scores improve alignment and enable adaptive compute allocation? Three independent research teams converged on this insight simultaneously.
contradicts the optimism that reasoning-traced graders verify well; answer-confirmation bias makes them rationalize
-
Does gradually tightening token budgets beat fixed budget training?
Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
grounds: both trace the failure to outcome-focused RL objectives
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models
- Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
- Large Language Model Reasoning Failures
- Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs
- VibeThinker-3B: Exploring the Frontier of Verifiable Reasoning in Small Language Models
Original note title
producing reasoning and evaluating reasoning are separable capabilities — outcome-reward training buys the first while starving the second