SYNTHESIS NOTE

Where do reasoning agents actually fail during long traces?

Does verifying only final answers miss the real sources of failure in multi-step reasoning? This explores whether intermediate process checks reveal errors that outcome-level scoring hides.

Synthesis note · 2026-05-28 · sourced from Test Time Compute

As reasoning models produce long traces of intermediate decisions and tool calls, the locus of reliability shifts. interwhen makes the framing explicit: verifying only the final answer misses errors that occur early in the trace, so the unit of verification should be the process — intermediate states, tool calls, and policy compliance — checked continuously as the trace unfolds. The paper's agentic results dramatize the gap: pass^4 on the Telecom τ²-bench domain rises from 32% to 87% once intermediate verification is added, because most failures are not wrong final answers but process violations that compound.

This is a pattern, not a single result. Process-level supervision recurs across the literature as more informative than outcome-level supervision: process reward models score steps, structural-feature supervision derives signal from trajectory shape, and completeness scaffolds force explicit derivation. interwhen's distinctive contribution to the pattern is that it verifies policy compliance — whether the trace obeys a stated policy — not just logical correctness, which extends process verification beyond math and code into agentic domains where "correct" is defined by rules rather than ground-truth answers.

The pattern matters because it changes what "reliable" means for an agent. A model can produce the right final answer through a non-compliant or unsafe process, and outcome verification will pass it; process verification will not. This aligns with the vault's recurring finding that final-output signals are systematically misleading about what happened inside the model. Counterpoint and limit: process verification only helps where the process is checkable — interwhen depends on synthesizable verifiers, and where no verifier exists (open-ended generation, subjective tasks) the reframe offers no leverage. The honest scope is "tasks with formal or policy-expressible correctness criteria," which is broader than math/code but not universal. Why it matters: it reorients reliability engineering for agents away from answer-grading toward continuous in-process auditing.

Inquiring lines that read this note 196

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Where do reasoning agents actually fail during long traces?

Inquiring lines that read this note 196

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4