How does o1-style reasoning relate to learned search processes versus memorized solutions?
This explores whether o1-style reasoning is genuinely learning to *search* through a problem — exploring, backtracking, recovering — or whether it's leaning on memorized patterns dressed up as reasoning, and what the corpus says about telling the two apart.
This explores whether o1-style reasoning is genuinely learning to *search* through a problem — exploring, backtracking, recovering — or whether it's leaning on memorized patterns dressed up as reasoning. The corpus draws a surprisingly clean line between the two, and the line starts in pretraining: an analysis of five million pretraining documents found that reasoning draws on broad, transferable *procedural* knowledge (the same few documents about how to do a kind of operation show up across many problems), while factual recall depends on narrow, document-specific *memorization* of the exact answer Does procedural knowledge drive reasoning more than factual retrieval?. So even before any o1-style training, 'reasoning' and 'memorizing' are mechanically different things.
Where this matters most is when memorization sneaks into something that *looks* like reasoning. One framework dissecting chain-of-thought traces found that memorization isn't all-or-nothing — it has local, mid-range, and long-range sources, and shallow *local* memorization (predicting the next step from immediately preceding tokens rather than actually working the problem) accounts for up to two-thirds of reasoning errors, especially as problems get harder and drift from the training distribution Where do memorization errors arise in chain-of-thought reasoning?. In other words, the failure mode of o1-style reasoning is often the model falling back on pattern-completion exactly when real search is needed.
The case that search is *learnable* — not just memorized — comes from training models on messy exploration instead of clean answers. 'Stream of Search' serializes the whole process, mistakes and backtracking included, and models trained this way score 25% higher than those trained only on optimal trajectories; they appear to build an internal world model for search and discover adaptive strategies rather than replaying a fixed procedure Does training on messy search processes improve reasoning?. Related work plants this even earlier, treating chain-of-thought as an exploratory action rewarded by information gain during pretraining itself Can chain-of-thought reasoning be learned during pretraining itself?. The lesson: you get search behavior by training on search, not on solutions.
But 'learned search' turns out to be a generous description of what current o1-style models actually do. Several notes converge on the finding that these models explore *unsystematically* — they wander like tourists rather than searching like scientists, lacking validity, effectiveness, and necessity, which makes success probability collapse exponentially as problems deepen Why do reasoning LLMs fail at deeper problem solving?. A reinforcing failure is 'underthinking': abandoning promising paths mid-exploration. Strikingly, simply penalizing thought-switching at decode time recovers accuracy with no retraining at all Do reasoning models switch between ideas too frequently? Why do reasoning models abandon promising solution paths? — which means the viable solution paths were *already there* and just got dropped. Structuring the breadth of exploration through learned abstractions, rather than going deeper on one chain, also outperforms naive sampling Can abstractions guide exploration better than depth alone?.
The deepest reframing is that o1-style training may not be teaching search *or* storing solutions — it may be *selecting* a capability that's already latent. Five independent methods (RL steering, critique tuning, decoding tweaks, SAE feature steering, RLVR) all elicit reasoning that base models already contain, suggesting post-training selects rather than creates Do base models already contain hidden reasoning ability?; modular 'cognitive tools' lift GPT-4.1 on competition math with no RL at all Can modular cognitive tools unlock reasoning without training?; and RL's real job may be to redirect a thinking habit the model misuses — turning counterproductive self-doubt into productive gap-analysis Does extended thinking help or hurt model reasoning?. So the cleanest answer to the question is a third option: o1-style reasoning is neither pure learned search nor memorized solutions, but the *elicitation and organization* of a latent search capacity — one that fails precisely when it slips back into memorized local patterns.
Sources 11 notes
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.