Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Paper · arXiv 2605.29288 · Published May 28, 2026

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty–geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.

Introduction. To address this gap, we take a diagnostic view of answer-correct long-CoT traces. We seek a trace-internal diagnostic explanation for why answer-correct traces may differ in training utility, rather than assuming that a long reasoning trace is uniformly useful once its final answer is correct. Our goal is not to claim that every long tail is harmful, nor to treat length as the central issue. Instead, we ask whether some traces enter a low-value post-conclusion continuation: the answer is already sufficiently supported, but subsequent reasoning remains locally costly while showing weak hidden-state progress. From the uncertainty perspective, we observe that some post-conclusion continuation remains locally costly or unstable, suggesting that the trace continues to explore after evaluator-based answer support has largely saturated. From the geometric perspective, this continued exploration shows weakened terminal-directional hidden-state progress. We refer to this hypothesized low-value phase as post-conclusion continuation before evaluating its downstream training effect.

Discussion / Conclusion. Analysis of Random Cut. We further introduce a random cut baseline to rule out the possibility that the improvement mainly comes from shorter responses. To align it with HCC, random cut preserves the final answer, removes a sentence-complete suffix from the reasoning trace, and controls the removed length to match the average truncation length of HCC. As shown in Table 4, random cut is consistently inferior to HCC on MATH500, AMC23, and GSM8K, yielding an average score of only 29.0 compared with 49.3 for HCC. This large gap suggests that arbitrary length reduction is not a reliable solution.

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

Synthesis notes that discuss concepts related to this paper