Can models learn to internalize search algorithms through training?
Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.
Standard chain-of-thought produces a reasoning trace. Meta-CoT asks a different question: what search process generates that trace? The framework draws from dual-process theory — CoT is System 1 (pattern-completed reasoning), while Meta-CoT is System 2 (deliberate search over reasoning strategies). The claim is that state-of-the-art models like o1 and DeepSeek-R1 already exhibit behaviors consistent with in-context search: they explore multiple paths, backtrack, and select among candidate reasoning chains rather than generating a single trace sequentially.
The training pipeline makes the internalization concrete: (1) generate linearized search traces from MCTS or A* algorithms applied to reasoning problems, (2) instruction-tune on these traces so the model learns the structure of search, (3) RL post-training to refine the search behavior. The linearized traces are the key innovation — they convert tree-structured search into sequential token predictions that autoregressive models can learn.
The speculative but important claim: if a model can learn to implement search algorithms in-context, then RL training on such a model constitutes optimization over algorithms rather than specific outputs. This could yield novel modes of problem-solving that neither symbolic tree-search nor standard CoT can achieve, because the model is not constrained by the specific search algorithm it was trained on — it can adapt and combine strategies.
This extends Does RL teach reasoning or just when to use it? in a significant direction: Meta-CoT proposes that search IS trainable as "how." The timing thesis says RL teaches WHEN to reason; Meta-CoT says the reasoning process itself can be internalized through exposure to search traces. If both are correct, RL training operates at two levels: activating reasoning (timing) and shaping the reasoning process (search internalization).
However, the tension with Does the choice of RL algorithm actually matter for reasoning? is notable: if the pretrained prior bounds exploration, then internalized search may still be constrained by what the model already knows. Meta-CoT would need to demonstrate that linearized search traces genuinely expand the exploration boundary rather than just reorganizing existing capability.
Inquiring lines that use this note as a source 7
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Is chain-of-thought reasoning actual computation or distribution imitation?
- How does chain-of-thought training change higher layer computations?
- What distinguishes intrinsic search from extrinsic search method approaches?
- Can models adapt and combine search strategies beyond their training algorithm?
- Does the pretrained prior actually constrain what internalized search can discover?
- How do timing and search internalization interact during reasoning post-training?
- How does continuous soft thinking explore multiple paths without explicit training?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
extends: Meta-CoT proposes that search CAN be trained as the "how" component
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
Meta-CoT goes further: linearized traces may teach a new capability, not just unlock existing
-
Can reinforcement learning discover reasoning strategies base models cannot?
Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
supports: algorithm optimization could be the mechanism for genuine novelty
-
Does the choice of RL algorithm actually matter for reasoning?
Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
tension: Meta-CoT claims search is trainable but prior-boundedness may constrain what internalized search can discover
-
Can models learn reasoning from predicting any text?
Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
complementary internalization approaches: Quiet-STaR internalizes rationale generation at every token during pretraining, while Meta-CoT internalizes search algorithms via linearized traces during post-training — both aim to embed reasoning into the forward pass but at different granularities (token-level prediction vs. trace-level search strategy)
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought
- Stream of Search (SoS): Learning to Search in Language
- Are Emergent Abilities in Large Language Models just In-Context Learning?
- Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity
- From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
- Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Chain-of-thought Reasoning Is A Policy Improvement Operator
Original note title
meta-cot frames chain-of-thought production as a search problem that models can learn to internalize