INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How can training approaches develo…›How do training data properties sh…›this inquiring line

AI can reason through problems it's never seen before — but ask it a specific fact and it either memorized it or didn't.

Why does reasoning transfer across different numbers but factual recall does not?

This explores why the ability to *reason through* a problem carries over to new inputs (different numbers, unseen problems) while the ability to *recall a specific fact* doesn't — and what that says about how these two capabilities are built differently inside a model.

This explores why reasoning generalizes across different numbers while factual recall stays brittle — and the corpus points to a clean answer: the two capabilities are built from different materials and live in different places. The most direct evidence comes from an analysis of five million pretraining documents Does procedural knowledge drive reasoning more than factual retrieval?, which found that when a model reasons, it draws on *procedural* knowledge — the how-to patterns (how to do arithmetic, how to apply a formula) scattered broadly across many unrelated documents. Because that procedure is the same whether the numbers are 7 or 700, it transfers. Factual recall is the opposite: the answer to 'when was X born' lives in a few specific documents, so the model has to have memorized that narrow target. Procedure is reusable; a fact is a lookup that either happened or didn't.

This split shows up architecturally too. One study of how LLMs process a query found that knowledge retrieval happens in the *lower* layers of the network while reasoning adjustments happen in the *higher* layers Why does reasoning training help math but hurt medical tasks?. That separation is why training a model harder on reasoning tends to sharpen math but can quietly *degrade* knowledge-heavy domains like medicine — you're tuning the higher-layer machinery while the lower-layer fact store goes untended. Reasoning and recall aren't just different skills; they're physically different subsystems that can be improved or harmed independently.

There's a catch worth knowing, though: reasoning transfer is real but it has limits, and those limits look a lot like the limits of recall. When you push chain-of-thought reasoning outside the distribution it was trained on — new task formats, unfamiliar lengths — it degrades predictably and starts producing fluent nonsense that imitates the *shape* of reasoning without valid logic underneath Does chain-of-thought reasoning actually generalize beyond training data?. And a token-level autopsy of reasoning errors found that 'local memorization' — the model leaning on immediately preceding tokens rather than actually computing — causes up to two-thirds of reasoning mistakes, especially as problems get harder Where do memorization errors arise in chain-of-thought reasoning?. So reasoning transfers *until* it secretly collapses back into recall.

The strangest corner of this: if reasoning is genuinely procedural, the procedure may matter more than the content. Models trained on deliberately *corrupted* reasoning traces — steps that are logically irrelevant — perform about as well as those trained on correct ones, and sometimes generalize better Do reasoning traces need to be semantically correct?. The traces seem to act as computational scaffolding that gives the model room to work, not as meaningful logic. That fits the same picture: what transfers across different numbers isn't a memorized answer but a *process* the model runs — which is also why a single training example can be enough to switch that process on Can a single training example unlock mathematical reasoning?. You don't teach a model to count to a new number; you activate a procedure it can already run on any number.

Sources 6 notes

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

Why does reasoning training help math but hurt medical tasks?

Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Show all 6 sources

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens2.58 match · arxiv ↗
When More is Less: Understanding Chain-of-Thought Length in LLMs1.73 match · arxiv ↗
Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity1.71 match · arxiv ↗
LLMs can implicitly learn from mistakes in-context1.70 match · arxiv ↗
Eliciting Reasoning in Language Models with Cognitive Tools1.66 match · arxiv ↗
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time0.93 match · arxiv ↗
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens0.89 match · arxiv ↗
Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models0.88 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an LLM researcher auditing claims about reasoning transfer vs. factual recall brittleness. The question remains open: why do reasoning capabilities generalize across different numbers while factual recall does not?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025 and include:
- Procedural knowledge (scattered across many pretraining documents) drives reasoning generalization; factual knowledge lives in few narrow documents and requires memorization (~2025).
- Knowledge retrieval happens in lower network layers; reasoning adjustments happen in higher layers; training reasoning harder can degrade knowledge-heavy domains like medicine (~2025).
- Chain-of-thought reasoning degrades predictably outside training distribution; local memorization accounts for up to two-thirds of reasoning errors, especially as difficulty increases (~2025).
- Models trained on deliberately corrupted reasoning traces perform comparably to correct ones and sometimes generalize better; traces act as computational scaffolding, not meaningful logic (~2025).
- A single training example can activate mathematical reasoning across new numbers via reinforcement learning (~2025).

Anchor papers (verify; mind their dates):
- arXiv:2411.12580 (Nov 2024) – Procedural knowledge in pretraining
- arXiv:2507.18178 (Jul 2025) – Decoupling knowledge and reasoning via dual-system cognition
- arXiv:2508.02037 (Aug 2025) – Token-level memorization in chain-of-thought
- arXiv:2504.20571 (Apr 2025) – One-shot reasoning activation via RL

Your task:
(1) RE-TEST EACH CONSTRAINT. For the claim that reasoning transfers but recall does not, determine whether newer models (o1, Gemini 2.5, Claude opus), larger scales, or improved instruction-tuning have begun to dissolve this boundary. Specifically: does factual recall now transfer across numbers (or contexts) when finetuned differently? Does reasoning still fail predictably out-of-distribution, or have test-time scaling / verifiers / retrieval integration relaxed this limit? Separate the durable question (procedural vs. memorized knowledge likely differ fundamentally) from the perishable claim (they are architecturally separated in lower vs. higher layers, or that separation is stable under scaling).
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months: any paper showing procedural/factual decoupling is overstated, or that reasoning and recall are more entangled than the library suggests.
(3) Propose 2 research questions that ASSUME the regime may have shifted: e.g., Can reasoning transfer be *taught* to factual domains via new architectures or training protocols? If reasoning is scaffolding, not logic, what determines which scaffold transfers?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

AI can reason through problems it's never seen before — but ask it a specific fact and it either memorized it or didn't.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8