Should reasoning benchmarks score final answers or reasoning traces?

Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?

Synthesis note · 2026-05-02 · sourced from Reasoning Methods CoT ToT

LR²Bench scores Exact Match on the final solution against deterministic CSP ground truth. It does not score the trace. This is the methodological choice that produces the dramatic 20-23.6% number, and it is the choice most other reasoning benchmarks have been quietly avoiding. Trace-based evaluation — does the reasoning look right, are the reflective phrases present, does the chain have the expected structure — would have inflated the result by counting plausible-looking reflection as evidence of reflection. CSPs do not allow that inflation because the constraint either holds or it doesn't.

The lesson generalizes. Do reasoning traces actually cause correct answers? argues the principle: derivational traces are stylistic mimicry of reasoning, not verified reasoning. Does RLVR actually improve mathematical reasoning or just coherence? argues the empirical version: training improves trace coherence without improving trace validity. LR²Bench operationalizes the methodological response — measure the outcome, not the trace, on tasks where the outcome is independently verifiable.

The harder corollary: many existing reasoning benchmarks are partly trace-evaluation in disguise. Math benchmarks where partial-credit grading is permissive, multi-step reasoning where intermediate steps can be "interpretation-credited" by graders, dialogue tasks where helpfulness is judged on tone — these all give credit for reflective appearance even when outcomes are wrong or absent. CSPs are valuable not because they are common in real applications but because they are epistemically clean: they isolate whether the model can do the thing, free from rhetorical credit.

For benchmark design more broadly, the LR²Bench template is: pick tasks with deterministic verifiers; measure final outcome; do not score the trace. Apply that template to a domain and the reasoning theater collapses into whatever reasoning is actually happening. Twenty percent on CSPs is the floor after the theater is removed. Benchmarks that produce higher numbers should explain how their design avoids re-introducing trace credit — and most cannot.

Inquiring lines that read this note 23

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do corrupted reasoning traces serve as effective supervision signals?

Why do correct reasoning traces tend to be shorter than incorrect ones?

Why do correct reasoning traces appear shorter than incorrect ones?

Why do benchmark improvements fail to reflect actual reasoning quality?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why does self-revision increase model confidence while degrading accuracy?

Why do final answers contradict what the thinking draft explicitly concluded?

Can single-axis benchmarks accurately predict agent deployment success?

What makes a trajectory score interpretable across different interactive benchmarks?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How much reasoning work happens in steps that don't affect the final answer?

What properties determine whether reward signals teach genuine reasoning?

Do reasoning traces actually make better reward models for grading answers?

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 98 in 2-hop network ·medium cluster Open in graph ↗

Should reasoning benchmarks score final answers … Do reasoning traces actually cause correct answers… Does RLVR actually improve mathematical reasoning …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
principle: traces are mimicry, not verification
Does RLVR actually improve mathematical reasoning or just coherence? RLVR post-training makes reasoning traces locally more consistent, but does this structural improvement translate to valid mathematical proofs? We investigate whether trace coherence is sufficient for correctness.
empirical: training improves coherence not validity

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reflection benchmarks should be solution-verifiable not trace-verifiable — Exact Match on the answer cuts through reasoning theater

Should reasoning benchmarks score final answers or reasoning traces?

Inquiring lines that read this note 23

Related concepts in this collection 2

Related papers in this collection 8

Search by related questions 5