Can outcome-focused objectives explain failures in reasoning evaluation?
This explores whether grading reasoning by its final answer — an outcome-focused objective — is itself the reason we keep misdiagnosing where and why reasoning models fail.
This reads the question as: does scoring reasoning on outcomes (did it get the right answer?) rather than process (did it reason soundly?) distort our picture of failure? The corpus suggests strongly yes — outcome-focused evaluation hides the actual failure surface and rewards the wrong things.
The sharpest evidence is direct: when you stop scoring final answers and start verifying intermediate steps, task success jumps from 32% to 87%, because most failures turn out to be process violations the answer-grade never sees Where do reasoning agents actually fail during long traces?. Generative judges that reason *about* each reasoning step likewise beat outcome-style classifier rewards, with far less training data — judging the trajectory carries information that judging the endpoint throws away Can judges that reason about reasoning outperform classifier rewards?. Both point at the same gap: an outcome objective is blind to *how* the model got there.
The unsettling part is what that blindness lets through. Models trained on deliberately corrupted or logically invalid reasoning traces perform about as well as those trained on correct ones Do reasoning traces need to be semantically correct? Does logical validity actually drive chain-of-thought gains? — meaning chain-of-thought is often imitated structure, not inference, and outcome scoring cannot tell the two apart Why does chain-of-thought reasoning fail in predictable ways?. So a model can score well while reasoning badly, and we'd never know. Outcome objectives don't just miss failures; they actively certify the wrong successes.
They also misattribute the failures they do catch. What look like 'reasoning collapses' are frequently execution limits — the model knows the algorithm but can't run it step-by-step at scale in text Are reasoning model collapses really failures of reasoning? — or structural disorganization, where viable solution paths get abandoned prematurely rather than never found Why do reasoning models abandon promising solution paths? Why do reasoning LLMs fail at deeper problem solving?. A final-answer score collapses all of these distinct mechanisms — execution, exploration, novelty — into one undifferentiated 'wrong,' which is exactly why benchmark cliffs get read as walls in reasoning ability when they're really instance-novelty boundaries Do language models fail at reasoning due to complexity or novelty?.
There's a twist worth keeping: outcome-shaped *reward* isn't always the villain. RL that optimizes toward outcomes naturally pushes models toward shorter chains as they get more capable Why does chain of thought accuracy eventually decline with length? and can flip extended thinking from counterproductive self-doubt into productive analysis Does extended thinking help or hurt model reasoning? — though it also leaves the overthinking pathology where more tokens drop accuracy from 87% to 70% Does more thinking time always improve reasoning accuracy?. So the honest synthesis: outcome objectives are fine for *training pressure* but corrosive for *evaluation*, because the same metric that nudges behavior also conceals whether the behavior is reasoning at all.
Sources 12 notes
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.