Why does reasoning transfer across different numbers but factual recall does not?
This explores why the ability to *reason through* a problem carries over to new inputs (different numbers, unseen problems) while the ability to *recall a specific fact* doesn't — and what that says about how these two capabilities are built differently inside a model.
This explores why reasoning generalizes across different numbers while factual recall stays brittle — and the corpus points to a clean answer: the two capabilities are built from different materials and live in different places. The most direct evidence comes from an analysis of five million pretraining documents Does procedural knowledge drive reasoning more than factual retrieval?, which found that when a model reasons, it draws on *procedural* knowledge — the how-to patterns (how to do arithmetic, how to apply a formula) scattered broadly across many unrelated documents. Because that procedure is the same whether the numbers are 7 or 700, it transfers. Factual recall is the opposite: the answer to 'when was X born' lives in a few specific documents, so the model has to have memorized that narrow target. Procedure is reusable; a fact is a lookup that either happened or didn't.
This split shows up architecturally too. One study of how LLMs process a query found that knowledge retrieval happens in the *lower* layers of the network while reasoning adjustments happen in the *higher* layers Why does reasoning training help math but hurt medical tasks?. That separation is why training a model harder on reasoning tends to sharpen math but can quietly *degrade* knowledge-heavy domains like medicine — you're tuning the higher-layer machinery while the lower-layer fact store goes untended. Reasoning and recall aren't just different skills; they're physically different subsystems that can be improved or harmed independently.
There's a catch worth knowing, though: reasoning transfer is real but it has limits, and those limits look a lot like the limits of recall. When you push chain-of-thought reasoning outside the distribution it was trained on — new task formats, unfamiliar lengths — it degrades predictably and starts producing fluent nonsense that imitates the *shape* of reasoning without valid logic underneath Does chain-of-thought reasoning actually generalize beyond training data?. And a token-level autopsy of reasoning errors found that 'local memorization' — the model leaning on immediately preceding tokens rather than actually computing — causes up to two-thirds of reasoning mistakes, especially as problems get harder Where do memorization errors arise in chain-of-thought reasoning?. So reasoning transfers *until* it secretly collapses back into recall.
The strangest corner of this: if reasoning is genuinely procedural, the procedure may matter more than the content. Models trained on deliberately *corrupted* reasoning traces — steps that are logically irrelevant — perform about as well as those trained on correct ones, and sometimes generalize better Do reasoning traces need to be semantically correct?. The traces seem to act as computational scaffolding that gives the model room to work, not as meaningful logic. That fits the same picture: what transfers across different numbers isn't a memorized answer but a *process* the model runs — which is also why a single training example can be enough to switch that process on Can a single training example unlock mathematical reasoning?. You don't teach a model to count to a new number; you activate a procedure it can already run on any number.
Sources 6 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.