SYNTHESIS NOTE
Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Should reasoning benchmarks score final answers or reasoning traces?

Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?

Synthesis note · 2026-05-02 · sourced from Reasoning Methods CoT ToT
Do reasoning traces show how models actually think? Can we actually trust reasoning model outputs?

LR²Bench scores Exact Match on the final solution against deterministic CSP ground truth. It does not score the trace. This is the methodological choice that produces the dramatic 20-23.6% number, and it is the choice most other reasoning benchmarks have been quietly avoiding. Trace-based evaluation — does the reasoning look right, are the reflective phrases present, does the chain have the expected structure — would have inflated the result by counting plausible-looking reflection as evidence of reflection. CSPs do not allow that inflation because the constraint either holds or it doesn't.

The lesson generalizes. Do reasoning traces actually cause correct answers? argues the principle: derivational traces are stylistic mimicry of reasoning, not verified reasoning. Does RLVR actually improve mathematical reasoning or just coherence? argues the empirical version: training improves trace coherence without improving trace validity. LR²Bench operationalizes the methodological response — measure the outcome, not the trace, on tasks where the outcome is independently verifiable.

The harder corollary: many existing reasoning benchmarks are partly trace-evaluation in disguise. Math benchmarks where partial-credit grading is permissive, multi-step reasoning where intermediate steps can be "interpretation-credited" by graders, dialogue tasks where helpfulness is judged on tone — these all give credit for reflective appearance even when outcomes are wrong or absent. CSPs are valuable not because they are common in real applications but because they are epistemically clean: they isolate whether the model can do the thing, free from rhetorical credit.

For benchmark design more broadly, the LR²Bench template is: pick tasks with deterministic verifiers; measure final outcome; do not score the trace. Apply that template to a domain and the reasoning theater collapses into whatever reasoning is actually happening. Twenty percent on CSPs is the floor after the theater is removed. Benchmarks that produce higher numbers should explain how their design avoids re-introducing trace credit — and most cannot.

Inquiring lines that use this note as a source 20

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 2

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 101 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reflection benchmarks should be solution-verifiable not trace-verifiable — Exact Match on the answer cuts through reasoning theater