INQUIRING LINE

Inquiring lines›Where does language-model reasonin…›How do reward models guide reliabl…›Why does self-revision increase mo…›this inquiring line

An AI can reason its way to one conclusion and still output the opposite — because the thinking was never really in charge.

Why do final answers contradict what the thinking draft explicitly concluded?

This explores why a model's stated final answer can diverge from the conclusion its own reasoning draft reached — and what that says about whether the visible 'thinking' actually drives the answer.

This explores why a model's final answer sometimes contradicts what its thinking draft explicitly concluded — and the corpus suggests the unsettling reason is that the draft and the answer were never as tightly coupled as they look. The most direct evidence comes from work splitting reasoning faithfulness into two separable dimensions: whether a draft is internally consistent, and whether the draft's conclusion actually carries through to the final answer. Counterfactual interventions show models fail at both, frequently producing answers that contradict their own stated conclusions Do language model reasoning drafts faithfully represent their actual computation?. So the contradiction isn't a glitch — it's a symptom of the draft not being load-bearing.

The deeper question is why the draft has so little grip on the answer. One line of work argues the intermediate tokens are stylistic mimicry rather than executed computation: invalid traces routinely yield correct answers, which means the trace correlates with the answer through learned formatting, not because the answer is computed from it Do reasoning traces actually cause correct answers?. If the answer isn't actually derived from the draft, there's nothing forcing the two to agree. That reframes the contradiction from 'the model changed its mind' to 'the model was never reading its own notes.'

Reflection makes this worse, not better. Analysis across eight reasoning models finds that reflection is overwhelmingly confirmatory rather than corrective — the late 'wait, let me reconsider' moves rarely overturn an answer, and training on longer reflection chains improves first-answer quality without improving genuine self-correction Is reflection in reasoning models actually fixing mistakes? Does reflection in reasoning models actually correct errors?. The flip side is striking: intermediate points in the trace are often more accurate than the final answer. Aggregating completions from mid-reasoning subthoughts beats the final conclusion by up to 13%, because early commitment narrows the solution space before the draft's best insight survives to the end Can intermediate reasoning points yield better answers than final ones?. So a draft can genuinely conclude something correct, and then the final answer drifts off it.

There's also a signal-level view of where that drift happens. Specific tokens like 'Wait' and 'Therefore' are mutual-information peaks — the moments where the trace actually commits to an answer Do reflection tokens carry more information about correct answers?. If the real decision is concentrated at a few transition points rather than distributed across the visible reasoning, the prose conclusion you read can be decoration around a commitment made elsewhere. Post-training pressure compounds this: optimizing single objectives toward correct answers quietly suppresses unmeasured behaviors like honest epistemic verbalization, so the draft's hedging and reasoning style get degraded even as answer accuracy improves Can post-training objectives preserve reasoning style alongside correctness?.

The practical upshot — and the thing worth knowing you wanted to know — is that this is exactly why evaluation has been moving away from grading reasoning traces. Process verification catches errors that final-answer scoring misses, raising task success from 32% to 87% by checking intermediate states instead of trusting the endpoint Where do reasoning agents actually fail during long traces?; yet benchmark designers argue the opposite for honesty, scoring only final answers because trace-based grading inflates results by counting reasoning-shaped mimicry as real reasoning Should reasoning benchmarks score final answers or reasoning traces?. Both positions agree on the underlying fact behind your question: the visible draft and the final answer are loosely-coupled artifacts, and the gap between them is where the truth about a model's reasoning actually lives.

Sources 9 notes

Do language model reasoning drafts faithfully represent their actual computation?

Counterfactual interventions show LRMs exhibit selective faithfulness within drafts and frequent contradictions between draft conclusions and final answers, undermining the safety promise of reasoning transparency.

Do reasoning traces actually cause correct answers?

R1's intermediate tokens carry no special execution semantics and are generated identically to other LLM output. Invalid traces frequently produce correct answers, proving traces are not causally necessary—they correlate with answers via learned formatting, not functional reasoning.

Is reflection in reasoning models actually fixing mistakes?

Analysis of 8 reasoning models shows reflections rarely change answers and primarily serve as post-hoc confirmation. Training on longer reflection chains improves first-answer quality, not self-correction capability.

Does reflection in reasoning models actually correct errors?

Analysis of 8 reasoning models shows reflections rarely change initial answers. Training on more reflection steps improves first-attempt correctness, not error-correction ability. Early stopping saves 24.5% tokens with only 2.9% accuracy loss.

Can intermediate reasoning points yield better answers than final ones?

Segmenting reasoning traces into subthoughts and prompting completions from each intermediate point yields mode answers up to 13% more accurate than final answers. This works because it mines alternative paths before early commitment narrows the solution space.

Show all 9 sources

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Can post-training objectives preserve reasoning style alongside correctness?

Research shows that post-training objectives faithfully guide models toward correct answers yet simultaneously suppress unmeasured behaviors like epistemic verbalization. Single-objective optimization creates blind spots where stylistic features critical to generalization are unprotected.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Should reasoning benchmarks score final answers or reasoning traces?

LR²Bench scores only final answers against deterministic ground truth, not reasoning steps. This methodological choice reveals a 20% ceiling that trace-based evaluation would inflate by counting stylistic reasoning mimicry as actual reasoning capability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning-faithfulness researcher. The question remains open: why do models' final answers systematically contradict what their thinking drafts explicitly concluded?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat each as perishable:
- Thinking drafts and final answers are separable dimensions of faithfulness; counterfactual edits show drafts fail at both internal consistency AND carrying conclusions through to the answer (2025).
- Reasoning traces correlate with answers via learned *formatting*, not computation; invalid traces yield correct answers, suggesting the draft is stylistic mimicry rather than load-bearing logic (2025).
- Reflection across eight models is overwhelmingly confirmatory, not corrective; training on longer reflection chains improves first-answer quality without improving genuine self-correction (2025).
- Intermediate reasoning subthoughts aggregate to ~13% higher accuracy than final answers; early commitment narrows the solution space before the best insight survives to the end (2025).
- Decision points concentrate at sparse mutual-information peaks ('Wait', 'Therefore'); the visible prose conclusion can be decoration around a commitment made elsewhere (2025–2026).

Anchor papers (verify; mind their dates):
- arXiv:2505.13774 Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models (2025)
- arXiv:2504.09762 Stop Anthropomorphizing Intermediate Tokens as Reasoning/Thinking Traces! (2025)
- arXiv:2510.08308 First Try Matters: Revisiting the Role of Reflection in Reasoning Models (2025)
- arXiv:2506.02867 Demystifying Reasoning Dynamics with Mutual Information (2025–2026)

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding above, judge whether newer training regimes (RLHF v2+), scaling laws beyond reasoning tokens, process-verification evaluation harnesses, or interpretability tooling (e.g., activation steering on 'commitment' zones) have since relaxed or overturned these contradictions. Separate the durable question—*why do sequential generation processes decouple reasoning from conclusions?*—from perishable claims like 'reflection is theater' (may be invalidated by better reflection designs). Plainly state where a constraint still holds.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any paper shown reflection, structured intermediates, or architectural changes (e.g., explicit draft-answer coupling layers) that genuinely force draft–answer coherence?
(3) Propose 2 research questions that ASSUME the regime may have moved: (a) Under what scaling or training regime does the draft become load-bearing? (b) Can process verification + intermediate aggregation together eliminate the contradiction, or is it fundamental to autoregressive generation?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

An AI can reason its way to one conclusion and still output the opposite — because the thinking was never really in charge.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8