Why does the first generated token trigger collapse of task superposition?
This explores why a model that seems to hold several possible reasoning paths at once appears to lock into one the moment it emits its first discrete token — and what the corpus says about avoiding that early commitment.
This explores why a model that's implicitly weighing several reasoning routes seems to lose them the instant it produces a real token. The clearest answer in the corpus comes from Soft Thinking, which makes the mechanism legible by removing it: standard decoding forces the model to sample one discrete token from a probability distribution, and that sampling act throws away every alternative the distribution was holding. Soft Thinking instead feeds forward a probability-weighted blend of concept embeddings, so the "superposition" of paths survives into the next step rather than being decided at the first one — and it gets accuracy gains while using fewer tokens, which is itself evidence that the discrete commitment was discarding useful signal Can we explore multiple reasoning paths without committing to one token?.
Why is the *first* token so consequential? Because of how strongly each token conditions what follows. The memorization work shows that local context — the immediately preceding tokens — accounts for up to two-thirds of chain-of-thought errors, meaning once a prefix is laid down the model is heavily steered by it rather than by the abstract problem Where do memorization errors arise in chain-of-thought reasoning?. So the first generated token isn't just one choice among many equal ones; it becomes the conditioning context that biases every subsequent step toward the branch it implies. Collapse isn't a single dramatic event so much as the first link in a chain that the rest of the generation is then pulled along.
This also reframes what reasoning chains *are*. The CoT critique work argues that chain-of-thought is constrained imitation — pattern-matching the shape of reasoning, bounded by the training distribution, rather than genuine inference Why does chain-of-thought reasoning fail in predictable ways?. Under that view, the model never really held independent "tasks" in clean parallel; it held a distribution over plausible continuations, and committing to a token simply selects which learned pattern to imitate. The superposition is statistical, not symbolic, which is exactly why a single sample can foreclose it.
The corpus also suggests not all tokens collapse things equally. Some tokens are mutual-information peaks — words like "Wait" and "Therefore" spike in their correlation with correct answers, and suppressing them specifically harms reasoning Do reflection tokens carry more information about correct answers?. Relatedly, models internally rank tokens by functional importance, preserving symbolic-computation tokens while discarding grammar and filler Which tokens in reasoning chains actually matter most?. Together these imply that the *cost* of premature collapse depends on which token you're committing to: lock in a high-information transition token early and you've decided a lot; emit filler and you've decided little.
What you didn't know you wanted to know: the most direct fix isn't smarter sampling but delaying the decision entirely. Soft Thinking keeps the distribution alive across steps and stops early via entropy, and decoupling approaches (planning before execution, asynchronous verification) similarly avoid betting everything on one committed trace Can verifiers monitor reasoning without slowing generation down?. The throughline: collapse is the price of discreteness, and several corners of the corpus are trying to pay it later, or not at all.
Sources 6 notes
Training-free method replaces discrete token selection with probability-weighted concept embeddings, preserving superposition of reasoning paths. Improves accuracy up to 2.48 points while reducing tokens 22.4% via entropy-based early stopping.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Decoupling verification from generation lets verifiers run alongside a single trace, forking to extract verifiable state and intervening only on violations. On correct runs the latency penalty is near-zero; interwhen matches or beats CoT across benchmarks at similar token budgets.