Does the pretrained prior actually constrain what internalized search can discover?
This explores whether a model's pretrained knowledge sets a hard ceiling on what search-based reasoning (internalized MCTS, tree search, self-improvement loops) can actually find — or whether search can discover genuinely new strategies the prior never contained.
This question reads the pretrained prior as a possible boundary: when a model learns to run search inside itself, is it inventing new reasoning, or just re-finding things already latent in its weights? The corpus is genuinely split on this, and the split is the interesting part.
The strongest "yes, the prior constrains" evidence comes from work showing post-training mostly *selects* rather than *creates*. Five independent methods — RL steering, critique tuning, decoding changes, feature steering, RLVR — all turn out to elicit reasoning that was already present in base-model activations, suggesting the bottleneck is elicitation, not capability (Do base models already contain hidden reasoning ability?). The prior also asserts itself in subtler ways: keyword priming after a gradient update is predictable from the *pre-learning* probability, with a sharp threshold below which learning simply doesn't take (Can we predict keyword priming before learning happens?), and models routinely fail to integrate fresh context when prior associations are strong enough to override it (Why do language models ignore information in their context?). On this view, search runs on rails the prior already laid down.
But other notes push back hard. Meta-CoT trains models on linearized search traces (MCTS, A*) and argues this lets them optimize over *algorithms* rather than outputs — potentially unlocking strategies that weren't there before (Can models learn to internalize search algorithms through training?). More striking, a bilevel autoresearch loop read its own inner code, found bottlenecks, and wrote new optimization mechanisms at runtime that *broke the inner loop's deterministic patterns* and delivered a 5x improvement (Can an AI system improve its own search methods automatically?). That looks like discovery escaping the prior, not obeying it.
The likely reconciliation is that the prior constrains the *raw materials* but not their recombination. Procedural knowledge — broad, transferable how-to patterns scattered across pretraining documents — drives reasoning generalization, unlike factual recall which is locked to specific memorized documents (Does procedural knowledge drive reasoning more than factual retrieval?). If reasoning is procedural rather than retrieved, search has room to compose known procedures into combinations the prior never explicitly held. Tree search makes this concrete: MCTS can manufacture its own quality signals and rank solution paths without human labels, generating training signal that didn't exist in the prior (Can tree search replace human feedback in LLM training?), and training on backward reasoning improves forward reasoning by forcing a structural understanding the forward-only prior lacked (Can backward reasoning during training improve forward reasoning?).
The quiet warning across all of this: the real constraint may be the *training method*, not the prior itself. RL collapses exploration diversity — search agents converge on narrow reward-maximizing strategies through the same entropy collapse seen in reasoning, while SFT on diverse demonstrations preserves breadth (Does reinforcement learning squeeze exploration diversity in search agents?). And direct fine-tuning corrupts knowledge in lower layers, whereas decoding-time proxy tuning leaves the prior intact (Can decoding-time tuning preserve knowledge better than weight fine-tuning?). So the honest answer is: the prior sets the vocabulary, but it's how you train the search — not the prior alone — that decides whether internalized search explores that space or quietly shrinks it.
Sources 10 notes
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Meta-CoT demonstrates that instruction-tuning on linearized MCTS and A* traces teaches models to implement search strategies internally. This enables optimization over algorithms themselves rather than specific outputs, potentially unlocking novel reasoning strategies.
An outer loop successfully read inner loop code, identified bottlenecks, and generated new Python mechanisms at runtime, discovering combinatorial optimization and bandit methods that broke the inner loop's deterministic patterns and improved performance on GPT pretraining by 5x.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
Training models simultaneously on forward reasoning, backward question generation, and backward reasoning improves forward-only performance by 13.53% average across 12 datasets. The mechanism: generating backward questions forces models to understand the inverse relationship between problem and solution, deepening understanding that transfers to forward reasoning without test-time overhead.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.