SYNTHESIS NOTE

Do chain-of-thought traces actually help users understand model reasoning?

Chain-of-thought explanations are often presented as transparency tools, but do they genuinely improve human understanding or create an illusion of interpretability? A human-subject study tests whether traces help users follow and evaluate model reasoning.

Synthesis note · 2026-02-22 · sourced from Reasoning Critiques

A common assumption behind CoT traces: they serve as explanations. The model shows its work, users can follow the reasoning, trust is established. This assumption turns out to be wrong in a specific and quantifiable way.

Empirical findings from a 100-participant human-subject study:

R1 traces: highest final solution accuracy, lowest human interpretability ratings
Algorithmically-generated semantically correct traces: lowest performance despite being verifiably correct
LLM-generated summaries of R1 traces: better interpretability, intermediate performance

The traces that are most useful for the model to generate correct answers are least useful for humans trying to understand those answers. The two objectives pull in opposite directions.

The mechanism: CoT traces used for SFT are optimized to be a training signal — to push the model toward correct token sequences through backpropagation. The properties that make a trace useful for training (complex recursive structure, non-linear exploration, self-doubt and revision cycles) are exactly the properties that make it cognitively opaque to humans.

This has a design implication that some systems are already acting on: GPT-OSS models generate a CoT trace (for model performance), a summary (for human communication), and a final answer. The trace is not shown to users. This separation acknowledges the decoupling.

The implication for AI transparency: showing users CoT traces is not showing them how the model reasons. It is showing them the model's training scaffold. What users need is a summary; what models need is the trace. Conflating the two in the name of "explainability" produces outputs that feel transparent without providing genuine interpretability.

This is a distinct claim from Do reasoning traces actually cause correct answers? — that note warns against inferring intentional reasoning from traces. This note adds: even if you don't anthropomorphize, the traces are the wrong artifact for human interpretability. Both wrong in different ways.

Controlled user-study evidence: traces don't just fail to help — they actively mislead. The interpretability-rating gap documented above measures how understandable traces feel; a between-subject user study ("Evaluating the False Trust Engendered by LLM Explanations") measures whether they improve judgment, and finds the stronger result. Showing users reasoning traces or post-hoc explanations raises their acceptance of the model's answer regardless of whether the answer is correct — the explanations are persuasive but not informative. This sharpens the decoupling claim from "traces serve the model not the user" to "traces given to the user degrade their ability to detect errors." The only explanation format that restored discrimination in that study was a contrastive dual explanation arguing both sides (see Do explanations actually help users spot AI mistakes?) — i.e., the fix is not a better one-sided trace but an artifact that argues against the model's own output.

Source (enrichment): Flaws — "Evaluating the False Trust Engendered by LLM Explanations", https://arxiv.org/abs/2605.10930

Inquiring lines that read this note 4

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why do correct reasoning traces tend to be shorter than incorrect ones?

Do longer chain-of-thought traces improve interpretability or just performance?

What actually drives chain-of-thought reasoning improvements in language models?

Can chain-of-thought traces harm rather than help user understanding?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 150 in 2-hop network ·medium cluster Open in graph ↗

Do chain-of-thought traces actually help users u… Do reasoning traces actually cause correct answers… Do language models actually use their reasoning st… Why do models trust their own generated answers? Does chain-of-thought reasoning reveal genuine inf… Does fine-tuning disconnect reasoning steps from f… Does supervised fine-tuning improve reasoning or j… Can LLM explanations actually help humans predict …

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do reasoning traces actually cause correct answers? Explores whether the intermediate 'thinking' tokens in R1-style models genuinely drive reasoning or merely mimic its appearance. Matters because false confidence in invalid traces could mask errors.
traces are not verified reasoning AND are not human-interpretable; two separate failures
Do language models actually use their reasoning steps? Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
causal faithfulness and user interpretability are both absent; neither is guaranteed by the presence of a trace
Why do models trust their own generated answers? Can language models reliably detect their own errors through self-evaluation? This explores whether the same process that generates answers can objectively assess their correctness.
models can't evaluate their own reasoning; neither can users from raw traces
Does chain-of-thought reasoning reveal genuine inference or pattern matching? Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
explains why the decoupling exists: if CoT is constrained imitation of reasoning patterns from training data, traces are optimized to continue familiar token sequences (model performance) not to explain the reasoning process to humans (interpretability)
Does fine-tuning disconnect reasoning steps from final answers? When models are fine-tuned on specific domains, do their chain-of-thought steps become less causally connected to their outputs? Three experiments test whether reasoning chains remain functionally faithful after training.
fine-tuning exacerbates both the faithfulness and interpretability dimensions: if traces are already decoupled from model performance (this note), and fine-tuning further decouples reasoning steps from final answers (faithfulness degradation), then post-fine-tuning traces serve neither the model nor the user
Does supervised fine-tuning improve reasoning or just answers? Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.
the SFT accuracy trap creates the conditions for the performance-interpretability decoupling: accuracy optimization selects for traces that drive correct outputs rather than traces that explain reasoning, directly producing the divergence documented here
Can LLM explanations actually help humans predict model behavior? Do model explanations enable users to accurately simulate how the model will behave on related inputs? This matters because it determines whether explanations genuinely improve human understanding or just create an illusion of understanding.
provides the metric-level evidence for this architectural decoupling: explanation precision (can users predict model behavior from explanations?) is uncorrelated with plausibility (do explanations look good?), confirming that RLHF-style optimization improves appearance without improving functional utility

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

cot traces optimize model performance, not user interpretability — the two objectives are decoupled

Do chain-of-thought traces actually help users understand model reasoning?

Inquiring lines that read this note 4

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 3