SYNTHESIS NOTE

Does every correct chain-of-thought trace improve fine-tuning?

Are all answer-correct reasoning traces equally valuable for training? This explores whether some correct traces contain reasoning that actually harms model learning despite reaching the right answer.

Synthesis note · 2026-06-03 · sourced from Reasoning Critiques

The standard assumption behind distilling long chain-of-thought traces into a smaller model via SFT is that a trace is useful supervision once its final answer is correct. "Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces" (2605.29288) breaks that assumption. It identifies post-conclusion continuation: a segment where the answer is already sufficiently supported, but the trace keeps reasoning — and that tail, even though it preserves the correct answer, is harmful to train on. A delete-only editor that excises the post-conclusion suffix while keeping the answer produces measurably better SFT than training on the full trace. The authors name the empirically confirmed phenomenon harmful continuation and ship a lightweight boundary proxy, Harmful Continuation Cut (HCC), that approximates where useful reasoning ends.

The diagnostic move is what makes this distinct. The harmful tail is characterized by an uncertainty–geometry mismatch: persistent local uncertainty (the model keeps exploring as if unsettled) combined with weakened terminal-directional hidden-state progress (the exploration no longer moves the representation toward the answer). That mismatch is the signature — not length itself. A random-cut baseline that removes a length-matched suffix without identifying where reasoning concluded performs far worse (avg 29.0 vs HCC's 49.3 across MATH500/AMC23/GSM8K), proving the gain comes from cutting the right segment, not from shorter outputs.

This sits beside but does not duplicate the vault's existing trace-quality findings. It is not the faithfulness decay of Does fine-tuning disconnect reasoning steps from final answers?, nor the benchmark-vs-quality divergence of Does supervised fine-tuning improve reasoning or just answers? — both describe what fine-tuning does to a model, whereas harmful continuation is a property of the training data itself. It sharpens the correlation in Why do correct reasoning traces contain fewer tokens?: shorter-is-better holds, but the causal lever is removing post-conclusion exploration, not length per se. And it gives a data-curation counterpart to Can reasoning steps be dynamically pruned without losing accuracy? — redundancy that is steerable at inference is also deletable at training time.

Relevant Notes

Why do correct reasoning traces contain fewer tokens? — sharpens the correlation: the causal lever is cutting post-conclusion exploration, not length
Does supervised fine-tuning improve reasoning or just answers? — complementary failure mode: this is data-side, the trap is model-side
Does fine-tuning disconnect reasoning steps from final answers? — another way answer-correct traces mislead SFT
Can reasoning steps be dynamically pruned without losing accuracy? — redundancy steerable at inference is deletable at training time

Inquiring lines that read this note 14

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do correct reasoning traces tend to be shorter than incorrect ones?

Do corrupted reasoning traces serve as effective supervision signals?

Why does supervised fine-tuning improve accuracy while degrading reasoning quality?

How does supervised fine-tuning degrade chain-of-thought faithfulness over time?

Do reasoning traces faithfully represent or merely mimic actual model reasoning?

Why do reasoning traces fail to accurately reflect model decision-making?

What actually drives chain-of-thought reasoning improvements in language models?

How much of chain-of-thought reasoning actually diverges from the final answer?

Can model confidence signals reliably improve reasoning quality and calibration?

Why does convergence stability sometimes mislead about reasoning correctness?

What properties determine whether reward signals teach genuine reasoning?

Do reasoning traces actually make better reward models for grading answers?

How do training data properties shape reasoning capability development?

Can reasoning improvements be attributed when optimizer and scaffold are unknown?

Does reinforcement learning teach reasoning or just when to reason?

Why does standard RL cause traces to collapse into redundant reasoning paths?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How does mining intermediate reasoning points compare to aggregating separate traces?

Does every correct chain-of-thought trace improve fine-tuning?

Relevant Notes

Inquiring lines that read this note 14

Related papers in this collection 8

Search by related questions 4