SYNTHESIS NOTE

Can models learn to internalize search algorithms through training?

Can chain-of-thought reasoning be taught as an explicit search process that models learn to implement internally? This matters because it could unlock algorithmic optimization rather than just output optimization.

Synthesis note · 2026-02-23 · sourced from Inference time scaling

Standard chain-of-thought produces a reasoning trace. Meta-CoT asks a different question: what search process generates that trace? The framework draws from dual-process theory — CoT is System 1 (pattern-completed reasoning), while Meta-CoT is System 2 (deliberate search over reasoning strategies). The claim is that state-of-the-art models like o1 and DeepSeek-R1 already exhibit behaviors consistent with in-context search: they explore multiple paths, backtrack, and select among candidate reasoning chains rather than generating a single trace sequentially.

The training pipeline makes the internalization concrete: (1) generate linearized search traces from MCTS or A* algorithms applied to reasoning problems, (2) instruction-tune on these traces so the model learns the structure of search, (3) RL post-training to refine the search behavior. The linearized traces are the key innovation — they convert tree-structured search into sequential token predictions that autoregressive models can learn.

The speculative but important claim: if a model can learn to implement search algorithms in-context, then RL training on such a model constitutes optimization over algorithms rather than specific outputs. This could yield novel modes of problem-solving that neither symbolic tree-search nor standard CoT can achieve, because the model is not constrained by the specific search algorithm it was trained on — it can adapt and combine strategies.

This extends Does RL teach reasoning or just when to use it? in a significant direction: Meta-CoT proposes that search IS trainable as "how." The timing thesis says RL teaches WHEN to reason; Meta-CoT says the reasoning process itself can be internalized through exposure to search traces. If both are correct, RL training operates at two levels: activating reasoning (timing) and shaping the reasoning process (search internalization).

However, the tension with Does the choice of RL algorithm actually matter for reasoning? is notable: if the pretrained prior bounds exploration, then internalized search may still be constrained by what the model already knows. Meta-CoT would need to demonstrate that linearized search traces genuinely expand the exploration boundary rather than just reorganizing existing capability.

Inquiring lines that read this note 8

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How does latent reasoning compare to verbalized chain-of-thought?

Is chain-of-thought reasoning actual computation or distribution imitation?

What actually drives chain-of-thought reasoning improvements in language models?

How does chain-of-thought training change higher layer computations?

How does objective evolution guide discovery better than fixed planning?

What distinguishes intrinsic search from extrinsic search method approaches?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Can models adapt and combine search strategies beyond their training algorithm?

How should iterative research systems allocate reasoning per search step?

Does the pretrained prior actually constrain what internalized search can discover?

How do training data properties shape reasoning capability development?

How do timing and search internalization interact during reasoning post-training?

How do soft continuous representations explore multiple reasoning paths simultaneously?

How does continuous soft thinking explore multiple paths without explicit training?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Why does explicit chain-of-thought work as a workaround for feedforward transformers?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 137 in 2-hop network ·dense cluster Open in graph ↗

Can models learn to internalize search algorithm… Does RL teach reasoning or just when to use it? Do base models already contain hidden reasoning ab… Can reinforcement learning discover reasoning stra… Does the choice of RL algorithm actually matter fo… Can models learn reasoning from predicting any tex…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
extends: Meta-CoT proposes that search CAN be trained as the "how" component
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
Meta-CoT goes further: linearized traces may teach a new capability, not just unlock existing
Can reinforcement learning discover reasoning strategies base models cannot? Does RL training truly expand what models can do, or does it just find solutions already hidden in base models? ProRL tests this by running RL longer and on diverse tasks beyond mathematics.
supports: algorithm optimization could be the mechanism for genuine novelty
Does the choice of RL algorithm actually matter for reasoning? Expert Iteration, PPO, and RC-RL show similar performance on reasoning tasks. The question is whether algorithm choice drives results or whether something deeper—like the pretrained model itself—sets the real limits.
tension: Meta-CoT claims search is trainable but prior-boundedness may constrain what internalized search can discover
Can models learn reasoning from predicting any text? Does training rationale generation at every token position on arbitrary internet text enable general reasoning without task-specific supervision? This challenges the assumption that reasoning requires curated QA datasets.
complementary internalization approaches: Quiet-STaR internalizes rationale generation at every token during pretraining, while Meta-CoT internalizes search algorithms via linearized traces during post-training — both aim to embed reasoning into the forward pass but at different granularities (token-level prediction vs. trace-level search strategy)

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

meta-cot frames chain-of-thought production as a search problem that models can learn to internalize

Can models learn to internalize search algorithms through training?

Inquiring lines that read this note 8

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4