Does the answer stage perform substantial reasoning beyond the thinking draft?
This explores whether the final 'answer' stage of a reasoning model actually computes anything new, or whether the real work has already happened (or never happened) in the thinking draft that precedes it.
This explores whether the answer stage does substantial reasoning beyond the thinking draft — and the corpus suggests the honest answer is messier than the two-stage 'think, then answer' picture implies. The cleanest evidence comes from work showing that drafts and answers are only loosely coupled: counterfactual interventions reveal that models contradict their own draft conclusions when producing the final answer, and are only selectively faithful even within the draft itself Do language model reasoning drafts faithfully represent their actual computation?. So the answer stage isn't a faithful continuation of the draft — but that cuts both ways. It means the answer sometimes *diverges* from the draft rather than simply executing it, which is the opposite of 'no work happens at the end.'
The more deflating finding is that on easy problems, the model has often already committed to an answer before the reasoning is finished. Activation probes catch models locking in their answer early on simple tasks, where the visible chain-of-thought is performative — but on hard tasks the same probes track genuine belief updates with real inflection points Does chain-of-thought reasoning reflect genuine thinking or performance?. That difficulty-dependence is the key: 'does the answer stage do real reasoning' has no single answer because it depends on whether the problem needed reasoning at all. This lines up with evidence that the intermediate tokens carry no special execution semantics — invalid traces routinely produce correct answers, so the trace correlates with the answer through learned formatting rather than driving it Do reasoning traces actually cause correct answers?, and logically broken chains perform nearly as well as valid ones Does logical validity actually drive chain-of-thought gains?.
If the draft tokens aren't where the reasoning lives, where is it? One strand argues a lot of the computation is happening in latent space that never gets verbalized at all — depth-recurrent models, Coconut, and Heima scale test-time compute through hidden-state iteration with no visible intermediate steps, suggesting verbalization is a training artifact rather than the reasoning itself Can models reason without generating visible thinking tokens?. Under that view, asking whether the *answer* stage reasons 'beyond' the *draft* is the wrong frame, because the real work isn't cleanly partitioned into either visible stage.
There's also a quantity trap worth knowing about. More draft does not mean more reasoning: extended thinking improves accuracy mainly by inflating output variance — widening the distribution so it covers the right answer more often — not by reasoning better, and past a threshold the distribution gets too diffuse and accuracy drops Does extended thinking actually improve reasoning or just increase variance?. Accuracy follows an inverted-U in trace length, and more capable models actually prefer *shorter* chains Why does chain of thought accuracy eventually decline with length? Does more thinking time always improve reasoning accuracy?. So a long, elaborate draft followed by a terse answer can mean the model is sampling-searching, not building toward a final computation.
The practical upshot — and the thing you might not have known you wanted: because the answer can silently contradict the draft, scoring only the final output misses where things actually break. Verifying the *process* — intermediate states and policy compliance as the trace unfolds — raised task success from 32% to 87%, because most failures are process violations that a correct-looking answer hides Where do reasoning agents actually fail during long traces?. The takeaway isn't 'the answer stage does or doesn't reason.' It's that the draft and the answer are separable, sometimes-contradictory artifacts, and treating either one as a faithful window into the model's actual computation is the mistake.
Sources 9 notes
Counterfactual interventions show LRMs exhibit selective faithfulness within drafts and frequent contradictions between draft conclusions and final answers, undermining the safety promise of reasoning transparency.
Activation probes show models commit to answers internally long before finishing their reasoning on easy tasks, but on hard tasks the reasoning process tracks real belief updates with detectable inflection points. Probe-guided early exit reduces tokens by up to 80 percent without accuracy loss.
R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.
Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.
Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.