An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

Paper · arXiv 2606.01462 · Published May 31, 2026

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid.

Introduction. Recent advances in AI have led to development of large reasoning models (LRMs): large language models (LLMs) that are trained to “reason” artificially by generating long chains of tokens before committing to a final answer [36, 13, 15]. LRMs have demonstrated impressive outcomes in domains such as software engineering [19, 55], research mathematics [14, 1], and abstract visual reasoning [6, 7]. Yet, numerous studies have also shown that these capabilities are “jagged” in nature [9, 34], failing to generalize reliably both beyond [20, 27, 56] and within their core domains of mathematics and coding [58, 33, 17, 38] . In this paper, we investigate a particularly striking way in which LRM reasoning is jagged: Even though LRMs are highly capable at producing reasoning, they can fail drastically at evaluating reasoning, exhibiting a production-evaluation gap. We demonstrate this gap through the Valid- Answer-Invalid-Reasoning (VAIR) dataset, a suite of math problems and solutions that we perturb to introduce trivial reasoning flaws, while preserving the original valid answer.

Discussion / Conclusion. In this paper, we investigate the production-evaluation gap with a dataset intended to disentangle reasoning evaluation from the confound of reasoning production. Our analyses reveal that while human reasoners robustly evaluate flawed reasoning even in the presence of valid answers, LRMs suffer from answer confirmation biases at both the behavioral and representational levels. We interpret this as a symptom of the outcome-focused objectives use in LRM training, which incentivize strong reasoning production capabilities at the expense of step-by-step evaluation. In Appendix A.4, we discuss how this answer confirmation bias may also affect the training of PRMs. Implications for Artificial Reasoning. Our findings highlight how “reasoning” is not a monolithic capability that models are uniformly improving at, even within a single domain like mathematics.

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

Synthesis notes that discuss concepts related to this paper