Why do reasoning tasks improve more than retrieval from lookup memory?
This explores why training and elicitation methods boost reasoning ability far more than they boost factual lookup — and the corpus points to a difference in how the two are stored in the first place.
This reads the question as: why does the same model improve more at reasoning than at recalling stored facts? The sharpest answer in the collection is that the two draw on fundamentally different kinds of knowledge. An analysis of five million pretraining documents found that reasoning leans on *procedural* knowledge — general strategies and methods that show up across many unrelated sources — while factual recall depends on narrow, document-specific memorization of the exact target fact Does procedural knowledge drive reasoning more than factual retrieval?. Procedure is reinforced everywhere; a specific fact lives in essentially one place. That asymmetry means reasoning has far more redundant, transferable signal to draw on and improve from, while lookup is a brittle, single-source retrieval that training can't make denser.
The second reason is that reasoning is often already latent in the model and just needs to be *unlocked*, whereas a missing fact is simply absent. Several independent methods — RL steering, critique fine-tuning, decoding tweaks, feature steering — all surface reasoning that base models already possess, suggesting post-training *selects* reasoning rather than creating it Do base models already contain hidden reasoning ability?. You see the same thing with modular cognitive tools that lift GPT-4.1 on hard math from 26.7% to 43.3% with no training at all, purely by structuring the elicitation Can modular cognitive tools unlock reasoning without training?. There's no equivalent move for a fact you never stored — you can't elicit what isn't there, you can only retrieve it.
The most counterintuitive thread: reasoning gains don't even require correct *content* to teach. Models trained on deliberately corrupted reasoning traces perform comparably to those trained on correct ones, and sometimes generalize better — the traces act as computational scaffolding, training the *process*, not transmitting facts Do reasoning traces need to be semantically correct?. This is exactly why reasoning is more improvable: you're shaping a reusable procedure, not memorizing a payload. Where memorization does leak into reasoning, it tends to *cause* errors rather than help — local token-level memorization accounts for up to 67% of chain-of-thought mistakes, especially as problems get harder Where do memorization errors arise in chain-of-thought reasoning?.
The corpus also reframes the retrieval side itself: lookup improves most not by cramming more facts but by *matching structure to the task*. StructRAG shows that routing a query to the right knowledge representation — a table, graph, or algorithm rather than uniform chunks — outperforms standard retrieval on reasoning-heavy questions, grounded in cognitive-fit theory Can routing queries to task-matched structures improve RAG reasoning?. In other words, the gains attributed to "retrieval" are often really gains in how reasoning organizes what it retrieves.
One caveat worth carrying away: "reasoning improves more" doesn't mean it's robust. Chain-of-thought degrades predictably once you leave the training distribution, producing fluent but logically broken output Does chain-of-thought reasoning actually generalize beyond training data?, and reasoning accuracy collapses with longer inputs well below the context limit Does reasoning ability actually degrade with longer inputs?. So the better framing may be: reasoning is more *trainable* because it's a transferable procedure widely seeded in pretraining — but transferable isn't the same as reliable.
Sources 8 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.