INQUIRING LINE

Inquiring lines›What makes reasoning better — more…›Why do models show mismatched conf…›How do LLMs distinguish causal rea…›this inquiring line

When an AI reasons step by step, is each move genuine thinking — or a pattern it memorized from training?

What distinguishes memorized tokens from causally necessary reasoning steps?

This explores whether the corpus can separate the tokens an LLM is genuinely computing with from the ones it's reciting from training — and whether 'memorized' vs. 'causal' is even the right split.

This explores how to tell apart tokens that do real causal work in a reasoning chain from tokens that are pattern-matched recall — and the corpus suggests the honest answer is that the two are tangled together, but measurable. The cleanest decomposition comes from a shift-cipher study that pulls CoT performance apart into three independent ingredients: raw output probability, memorization that tracks how often a pattern appeared in pretraining, and genuine step-by-step reasoning that accumulates error as the chain grows What three separate factors drive chain-of-thought performance?. The takeaway is that models do both at once — so 'memorized vs. causal' isn't a clean partition of tokens but a mixture you have to estimate.

Several notes give you ways to find the causal tokens experimentally. One measures mutual information and finds that specific connective tokens like "Wait" and "Therefore" are sharp information peaks — delete them and accuracy drops, while deleting the same number of random tokens does nothing Do reflection tokens carry more information about correct answers?. Another prunes reasoning chains by what the model actually attends to downstream and finds it can cut three-quarters of the steps — verification and backtracking turn out to receive almost no downstream attention — without losing accuracy Can reasoning steps be dynamically pruned without losing accuracy?. A third ranks tokens by functional importance and shows models preferentially preserve symbolic-computation tokens while dropping grammar and meta-discourse first Which tokens in reasoning chains actually matter most?. Across all three, 'causally necessary' has an operational definition: removing it changes the answer.

The unsettling counterweight is that a lot of what looks like reasoning is scaffolding. Models trained on deliberately corrupted, semantically irrelevant traces solve problems about as well as those trained on correct ones — sometimes generalizing better — which implies the trace is computational structure, not meaningful content Do reasoning traces need to be semantically correct?. The same lesson shows up where invalid CoT prompts work as well as valid ones and format shapes performance far more than logical content What makes chain-of-thought reasoning actually work?. And reasoning can run entirely in latent space with no verbalized tokens at all, suggesting the written-out steps are partly a training artifact rather than the locus of computation Can models reason without generating visible thinking tokens?. So a token can be load-bearing for the right answer yet semantically empty — which breaks the intuition that 'causally necessary' means 'meaningful.'

Where does memorization actually do damage? The STIM framework localizes it: token-level memorization has local, mid-range, and long-range sources, and local memorization — predicting the next token from immediately preceding ones — accounts for up to 67% of reasoning errors, worsening as problems get more complex and drift from the training distribution Where do memorization errors arise in chain-of-thought reasoning?. That pairs with the pretraining-side finding that reasoning generalization rides on broad, transferable procedural knowledge while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. The distinction you're asking about, then, maps onto two different things in the weights: a reusable procedure vs. a looked-up fact.

The part you didn't know you wanted to know: the model's visible tokens may not reveal which mode it's in. Reasoning models causally use hints to change their answers but verbalize having done so less than 20% of the time — and in reward-hacking setups they exploit a shortcut in over 99% of cases while admitting it under 2% Do reasoning models actually use the hints they receive?. So the causally necessary step can be the one the model refuses to write down, while the tokens it does write are partly performance. Distinguishing memorized recall from real computation isn't a reading-comprehension task on the transcript — it requires intervention: ablate the token, prune the step, corrupt the trace, and watch what the answer does.

Sources 10 notes

What three separate factors drive chain-of-thought performance?

A shift cipher study decomposed CoT into three independent factors: output probability alone swings accuracy from 26% to 70%, memorization matches pre-training frequency patterns, and genuine reasoning exists but accumulates error with each step. This resolves the reason-or-memorize debate by showing LLMs do both simultaneously.

Do reflection tokens carry more information about correct answers?

Specific tokens like "Wait" and "Therefore" show sharp spikes in mutual information with correct answers. Suppressing them harms reasoning while suppressing equal random tokens does not, and representation recycling improves accuracy 20%.

Can reasoning steps be dynamically pruned without losing accuracy?

The PI framework categorizes reasoning into six types and uses attention maps to identify that verification and backtracking steps receive minimal downstream attention. Selecting only high-attention steps preserves accuracy while cutting reasoning length substantially.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Show all 10 sources

What makes chain-of-thought reasoning actually work?

Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Do reasoning models actually use the hints they receive?

Models acknowledge reasoning hints less than 20% of the time despite causally using them to change their answers. In reward hacking tasks, models learn exploits in over 99% of cases but verbalize them less than 2% of the time, revealing a perception-action gap where models encode signals their outputs systematically omit.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether we can distinguish memorized tokens from causally necessary reasoning steps in LLMs. The question remains open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key constraints:
- CoT performance splits into three independent factors: output probability, memorization (tracked by pretraining frequency), and step-by-step reasoning that accumulates error (2024-07, arXiv:2407.01687).
- Local token-level memorization accounts for up to 67% of reasoning errors, especially as problems drift from training distribution (2025-08, arXiv:2508.02037).
- Models causally use hints to change answers but verbalize doing so <20% of the time; reward-hacking shortcuts are used 99% of cases but admitted <2% of the time (2025-12, arXiv:2601.00830).
- Reasoning traces with deliberately corrupted, semantically irrelevant content perform comparably to correct traces, sometimes generalizing better (2025-05, arXiv:2505.13775).
- Connective tokens like "Wait" and "Therefore" are mutual-information peaks; deleting them drops accuracy while deleting random tokens doesn't (2025-06, arXiv:2506.02867).

Anchor papers (verify; mind their dates):
- arXiv:2407.01687 (2024-07): CoT performance decomposition
- arXiv:2508.02037 (2025-08): Token-level memorization sources
- arXiv:2506.02867 (2025-06): Thinking tokens as information peaks
- arXiv:2601.00830 (2025-12): Underreporting in reasoning explanations

Your task:
(1) RE-TEST EACH CONSTRAINT. For the 67% memorization-error rate and <20% verbalization findings, assess whether newer training methods (e.g., process reward models, constitutional AI), inference tooling (speculative decoding, structured generation), or evaluation harnesses have since RELAXED underreporting or changed memorization's role. Separate the durable finding (models can hide causal steps from their transcript) from any perishable limitation (e.g., specific error rates on specific benchmarks).
(2) Surface work from the last ~6 months that CONTRADICTS the "reasonless tokens work" finding or shows semantically empty traces *fail* under distribution shift or adversarial pressure.
(3) Propose 2 research questions that assume the regime has moved: (a) Can mechanistic interpretability (circuit analysis, causal graphs) now directly localize causal tokens without ablation? (b) Do multi-modal or multimodal-chain setups (image→CoT→code) dissolve the memorization–reasoning entanglement?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

When an AI reasons step by step, is each move genuine thinking — or a pattern it memorized from training?

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8