Where do memorization errors arise in chain-of-thought reasoning?
Explores whether memorization in language model reasoning can be localized to specific token sources and which sources dominate error patterns during long generations.
STIM (Source-aware Token-level Identification of Memorization) argues that memorization in long CoT generations must be identified at the token level, not the sequence level. A single faulty token — produced by memorization rather than reasoning — can trigger cascading errors through subsequent steps. Existing metrics report a single score for the entire sequence, missing where and why individual tokens go wrong.
Three distinct memorization sources influence each token:
Local memorization — frequent continuations of immediately preceding tokens. The model generates the next token based on statistical co-occurrence with its local context, not reasoning. This is the dominant error source, responsible for up to 67% of wrong tokens.
Mid-range memorization — tokens that frequently co-occur with the generation prefix. The model has seen this pattern in pretraining and reproduces it, even when the current reasoning context requires a different continuation.
Long-range memorization — frequent co-occurrence with tokens in the input prompt. The prompt triggers a familiar pattern from pretraining that overrides the reasoning chain.
Key distributional findings:
- Complexity increases memorization. As reasoning complexity increases, models rely more on memorization — they fall back on familiar patterns when the reasoning becomes harder.
- Distributional shift increases memorization. Moving toward rare or atypical inputs strengthens memorization signals. The model has less training experience to draw on, so it relies more on pattern-matching from similar-but-not-identical training examples.
- Base vs long-tail reversal. In base settings, memorization often supports correct answers (familiar patterns lead to right conclusions). In long-tail scenarios, the same memorization mechanisms drive errors — defective recall when faced with unfamiliar contexts.
This connects to the broader reasoning trace reliability cluster. Since Which sentences actually steer a reasoning trace?, STIM adds a complementary mechanism: specific tokens at the sub-sentence level carry memorization-driven influence that can derail even well-structured reasoning chains. The failure is more granular than thought-level — it operates at individual tokens.
The practical implication: high memorization scores are strong indicators of reasoning failures (measured via Precision@k and Recall@k). This offers a potential diagnostic tool for identifying where reasoning chains are unreliable, independent of whether the final answer is correct. This diagnostic capability directly addresses the faithfulness problem: since Do language models actually use their reasoning steps?, STIM's memorization scores provide a token-level mechanism for faithfulness failure — memorized tokens are causally unnecessary (the answer was determined by pattern-matching, not reasoning) and causally insufficient (the memorized continuation may diverge from the reasoning the chain appears to perform).
Inquiring lines that use this note as a source 94
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why does the first generated token trigger collapse of task superposition?
- What distinguishes genuine reasoning activation from memorization-assisted answer recall?
- Why does storing past judgments in memory make current evaluations worse?
- Why does attention-based drift happen automatically during generation?
- Why do correct reasoning traces in language models tend to be shorter?
- Does each reasoning step in chain-of-thought introduce cumulative error?
- Why does chain-of-thought fail when problems lack matching training schemata?
- How much does memorization capacity limit a model's ability to learn new information?
- How do semantic failure modes map to attentional and intentional layers?
- Why does fine-tuning sometimes damage chain-of-thought reasoning even when accuracy improves?
- Does thinking-token overuse actually degrade reasoning accuracy in practice?
- How do attention heads separate text retrieval from internal thought representation?
- What causes snowball errors to accumulate across reasoning steps in language models?
- How do failed branches remain in context and contaminate subsequent reasoning?
- How does memorization capacity saturation trigger the grokking transition?
- How can entailment benchmarks separate genuine reasoning from memorization effects?
- Why do entities trigger memorized propositions instead of enabling reasoning?
- How do the three grokking phases connect to memorization capacity limits?
- Can data pruning strategies exploit the finite nature of memorization capacity?
- Do models with unfilled memorization capacity appear to generalize falsely?
- Why is extracting training data insufficient proof that models memorize?
- Where do collider-type reasoning errors appear in real-world decisions?
- Why does mixing reasoning traces from different teachers destabilize learning?
- Can derivational traces be distinguished from stylistic mimicry of reasoning?
- How do cortical columns implement local inference over memory cycles?
- How do insert, forget, and merge operations maintain thought coherence over time?
- Why does the same recalled information lead to different reasoning conclusions?
- Why does chain of thought reasoning fail across different prompt formats?
- How do exemplar properties affect the brittleness of chain-of-thought prompting?
- How does chain-of-thought pressure models to rationalize pattern exceptions?
- Why does chain-of-thought prompting fail to fix length-induced reasoning degradation?
- Why does parallel thinking outperform sequential thinking under token limits?
- Can deliberate corruption of reasoning traces harm out of distribution generalization?
- Why do verbalized reasoning chains fail on certain problem classes?
- How do retrieval heads enable chain-of-thought reasoning to reference earlier context?
- How does chain-of-thought length affect attention to constraint tokens?
- Does verbal step-by-step reflection preserve learning signals that abstraction removes?
- Why does the generation-verification gap disappear for factual recall tasks?
- Why do longer reasoning chains correlate with lower accuracy in o1-like models?
- Does chain-of-thought prompting overcome implicit meaning deficits in text analysis?
- Why do SFT models memorize patterns instead of learning generalizable reasoning?
- Why does cross-text analogical reasoning fail when semantics decouple from symbols?
- How do recursive language models rethink where to store reasoning?
- What details do high-level trajectory abstractions lose that state-grounded recall preserves?
- How does self-referential processing transfer to other reasoning tasks?
- What sparse mechanistic structures drive reasoning traces in language models?
- Do shorter reasoning chains maintain instruction adherence better than longer ones?
- Can inserted errors in reasoning drafts produce predictable downstream effects?
- Can thinking token density explain reasoning performance beyond total length?
- Do shorter correct reasoning traces contain more thought anchors than longer ones?
- How does distributional shift toward rare inputs change memorization reliance?
- Can memorization scores diagnose where reasoning chains become unreliable?
- What distinguishes memorized tokens from causally necessary reasoning steps?
- How does backtracking capability address error compounding in chain-of-thought reasoning?
- What mechanisms cause reasoning models to wander rather than focus?
- Does more thinking always improve language model accuracy?
- How do single wrong steps corrupt entire reasoning chains?
- Can out-of-distribution tests expose memorization in reinforcement learning fine-tuned models?
- Can simple structure perturbations reliably expose memorization in reasoning models?
- How do out-of-distribution tests reveal that optimization learning is memorization?
- How does co-activation shape which memories become linked together?
- What distinguishes formation, evolution, and retrieval as separate memory dynamics?
- Why does chain-of-thought fail to improve multimodal model perception performance?
- How do memorization and attention map onto different memory systems?
- Why do reasoning tasks improve more than retrieval from lookup memory?
- How do prior errors in reasoning context amplify future mistakes?
- What distinguishes data that generalizes broadly from task-specific memorization?
- How do prior errors in context history amplify future mistakes in long tasks?
- How do thought anchors differ from individual forking tokens mechanistically?
- Does grokking in modular arithmetic follow the same three-phase learning trajectory?
- Does next-token prediction actually explain how human thought works?
- How do the three-axis taxonomies of memory forms and functions differ?
- Why does reasoning transfer across different numbers but factual recall does not?
- What makes token-level reasoning during pretraining different from test-time chain-of-thought?
- Does chain-of-thought accuracy degrade with longer reasoning traces?
- What quality filters distinguish useful reasoning enrichment from shallow repetition?
- What evidence shows that reasoning chains encode token-level functional structure?
- Why does uniform memory consolidation sometimes degrade below the no-memory baseline?
- Can we measure how much prior errors bias subsequent token predictions?
- Why does unstructured chain-of-thought permit assumption-based errors that templates prevent?
- Why do language model reasoning chains look fluent when they deviate from the task?
- Why do corrupted reasoning traces sometimes generalize better than correct ones?
- Can models learn to optimize their own chain-of-thought generation?
- What makes some bottlenecks invisible to chain-of-thought training?
- Why does chain-of-thought work for math but fail for grounding?
- Why does semantic deduplication reduce memorization in fine-tuned models?
- What is the theoretical capacity limit before memorization saturates?
- Can contamination-free evaluation distinguish between memorization and genuine prediction ability?
- How brittle are chain-of-thought exemplars across order and complexity?
- How does o1-style reasoning relate to learned search processes versus memorized solutions?
- What does next-token prediction tell us about compositional linguistic competence?
- What makes memorized paragraphs harder to corrupt than generic text?
- Why are rare tokens the hooks for verbatim model memorization?
- What makes factual memorization less efficient than tool-based retrieval?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
complementary granularity: thought anchors operate at sentence level, STIM at token level
-
Do high-entropy tokens drive reasoning model improvements?
Explores whether only a small fraction of tokens—those with high entropy at decision points—actually matter for improving reasoning performance in language models, and whether training on them alone could work as well as full training.
both identify sparse tokens with disproportionate influence; STIM adds the memorization-source dimension
-
Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
corrupted traces may work BECAUSE they break local memorization patterns, forcing the model into generalization mode
-
Does chain-of-thought reasoning reveal genuine inference or pattern matching?
Explores whether CoT instructions unlock real reasoning capabilities or simply constrain models to mimic familiar reasoning patterns from training data. This matters for understanding whether language models can actually reason abstractly.
local memorization provides the mechanism: the model reproduces familiar reasoning patterns rather than deriving new ones
-
Do language models actually use their reasoning steps?
Chain-of-thought reasoning looks valid on the surface, but does each step genuinely influence the model's final answer, or are the reasoning chains decorative? This matters for trusting AI explanations.
STIM provides the token-level mechanism: memorized tokens are neither causally sufficient nor necessary for reasoning
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
- Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
- Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
- Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling
- When More is Less: Understanding Chain-of-Thought Length in LLMs
- How do Transformers Learn Implicit Reasoning?
- Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models -- A Survey
- Reasoning Beyond Chain-of-Thought: A Latent Computational Mode in Large Language Models
Original note title
token-level memorization in CoT reasoning has three distinct sources and local memorization causes up to 67 percent of reasoning errors