How does chain-of-thought length affect attention to constraint tokens?
This reads 'constraint tokens' as the load-bearing tokens in a reasoning chain — the actual computation, conditions, and pivot words that determine the answer — and asks whether making the chain longer changes how much the model weights them; the corpus doesn't measure raw attention weights, but it maps closely onto how length dilutes or concentrates the signal-carrying tokens.
This reads 'constraint tokens' as the tokens that actually do the work — the symbolic computation, the conditions, the pivot words — rather than the filler around them, and asks how chain length changes their pull on the model. No note here measures attention weights on constraints directly, but several converge on a sharper version of the question: as a chain grows, the tokens that matter get crowded by tokens that don't. One finding is that reasoning accuracy drops sharply with input length well before the context window is full — from 92% to 68% with just 3,000 tokens of padding, an effect that's task-agnostic and persists even with chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. Length itself, not difficulty, degrades the model's grip.
What's striking is how few tokens actually carry the constraint. One study finds the signal lives in sparse spikes — words like 'Wait' and 'Therefore' show sharp peaks in mutual information with the correct answer, and suppressing those specific tokens harms reasoning while suppressing an equal number of random tokens does not Do reflection tokens carry more information about correct answers?. A parallel result shows models internally rank tokens by function and preferentially preserve symbolic-computation tokens while pruning grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. And 'Chain of Draft' shows you can match full chain-of-thought accuracy at 7.6% of the token cost — meaning roughly 92% of a verbose chain is style and documentation, not constraint Can minimal reasoning chains match full explanations?. So a longer chain isn't adding more constraint; it's diluting the constraint that's already there.
The dilution has a measurable failure signature. Token-level memorization errors are dominated by *local* memorization — predictions driven by the immediately preceding tokens — which accounts for up to 67% of reasoning errors, and gets worse as complexity and length grow Where do memorization errors arise in chain-of-thought reasoning?. In other words, as the chain lengthens, the model's next token leans more on the recent local window and less on the original constraints stated far back. That's the mechanism behind the 'lost in the middle' feeling: the constraint tokens are still in context, but their influence decays relative to nearby surface patterns.
This is why length has an optimum rather than a monotone benefit. Accuracy follows an inverted U — peaking at intermediate length and falling off as chains get longer, with the optimal length shrinking as the model gets more capable Why does chain of thought accuracy eventually decline with length?. Pushed to extremes, raising thinking tokens from ~1,100 to ~16K dropped benchmark accuracy from 87.3% to 70.3% Does more thinking time always improve reasoning accuracy?. The reader's intuition that 'more reasoning = more attention to the rules' is backwards: past a point, more reasoning means the rules compete with an ever-larger pile of self-generated filler.
The deeper reframe the corpus offers is that chain length may not reflect adaptive computation on constraints at all. Trace length correlates with difficulty only in-distribution and decouples entirely out-of-distribution — length mostly reflects recall of training schemas, not effort spent honoring constraints Does longer reasoning actually mean harder problems?. So if you came expecting that longer chains buy you more careful attention to the constraints, the more useful takeaway is the opposite: the constraint-carrying tokens are sparse and fragile, and length tends to bury them rather than amplify them.
Sources 8 notes
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.
Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.