SYNTHESIS NOTE

Does training on messy search processes improve reasoning?

Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.

Synthesis note · 2026-02-22 · sourced from Question Answer Search

Language models are almost never shown fruitful mistakes during training. They see only the outcome of a decision-making process, not the process itself. Stream of Search (SoS) demonstrates what happens when you change this: train LMs on the full search process — exploration, dead ends, backtracking, pruning — represented as a serialized string.

The results on Countdown (a game requiring combining numbers with arithmetic to reach a target): SoS-pretrained models achieve 25% higher accuracy than models trained to predict only the optimal trajectory. The improvement comes from learning to search rather than learning to predict.

SoS systematizes search components into a unified language that captures multiple symbolic search strategies (BFS, DFS, and their variations) in a common serialized format. This is "intrinsic" search — the model learns an internal policy for exploration — unlike "extrinsic" approaches (ToT, GoT) that use fixed external search strategies and call the LM only for generation and evaluation. The distinction matters: extrinsic methods have high inference costs and fixed strategies, while intrinsic search is learned and adaptive.

The most striking finding: SoS models learn internal world models for search. Unlike symbolic search that relies on an explicit environment model, SoS models simulate state transitions themselves. This means the model can generalize its search strategy to novel problems without an explicitly programmed transition function.

This is distinct from the Do reasoning traces need to be semantically correct? finding. That result shows trace CONTENT is dispensable — semantically irrelevant tokens still provide computational scaffolding. SoS shows something different: the search PROCESS itself is valuable training data. It's not that mistakes don't matter (corrupted traces) — it's that the experience of making and recovering from mistakes teaches something that pure success doesn't.

The self-improvement connection is direct: after SoS pretraining, models can improve via STaR (self-taught reasoning) and APA (advantage-weighted policy aggregation) — optimizing for correctness on top of the learned search capability. This addresses the snowballing error problem (each wrong step makes subsequent steps more likely wrong) by teaching models to BACKTRACK rather than compound errors.

Since Why do reasoning LLMs fail at deeper problem solving?, SoS provides a potential training solution: if wandering exploration is the problem, training on systematic search processes (including recovery from wrong paths) could teach the systematic search strategy that current reasoning models lack.

SoS is fundamentally a training FORMAT intervention. Since Does training data format shape reasoning strategy more than domain?, representing the search process as serialized strings -- with explicit backtracking markers, dead-end annotations, and pruning decisions -- is a format choice that shapes the resulting reasoning strategy. SoS training on BFS-like exploration vs DFS-like exploration mirrors the MC/FF format distinction: the serialization format determines whether the model learns breadth-first or depth-first search behavior. And since How quickly do errors compound during model self-training?, SoS's inclusion of backtracking in training data directly addresses the avalanching vulnerability -- a model that has learned to recognize dead ends and backtrack from them is structurally less susceptible to compounding errors in self-training loops.

Inquiring lines that read this note 27

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do knowledge graphs enable efficient multi-hop reasoning over alternatives?

How much does organized knowledge improve learning efficiency versus raw data?

How does example difficulty affect learning efficiency in language models?

Does self-reflection enable models to reliably correct their errors?

How can AI systems learn from failures without cascading errors?

Do corrupted reasoning traces serve as effective supervision signals?

Why do human-curated thought examples fail to improve model thinking?

How can LLM user simulators model realistic goal-driven conversation?

When does simulated search outperform real search for agent training?

Do language models develop causal world models or rely on statistical patterns?

Can external summarization solve exploration problems in complex real-world environments?

How can models identify insufficient information and respond appropriately without guessing?

How does proactive critical thinking enable models to identify missing information?

How do training data properties shape reasoning capability development?

How does reasoning graph topology affect breakthrough insights and generalization?

What distinguishes systematic search from wandering exploration in reasoning?

Does RLHF training sacrifice accuracy and grounding for user agreement?

How does RLHF training push chatbots toward problem-solving over exploration?

How does objective evolution guide discovery better than fixed planning?

What distinguishes intrinsic search from extrinsic search method approaches?

When do additional thinking tokens stop improving reasoning performance?

Does more thinking always improve language model accuracy?

What capability tradeoffs emerge when scaling model reasoning abilities?

Can training models on backward reasoning improve their forward planning ability?

How do training priors constrain what context information can override?

Why does evaluating errors teach more than imitating correct responses?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Can models adapt and combine search strategies beyond their training algorithm?

How do self-generated feedback mechanisms enable effective model learning?

What emergent behaviors do models develop when trained on underspecified pedagogical tasks?

Does reinforcement learning teach reasoning or just when to reason?

Why does extended reasoning training improve exploration without adding new capabilities?

How should iterative research systems allocate reasoning per search step?

How does o1-style reasoning relate to learned search processes versus memorized solutions?

Can alternative training methods improve on supervised fine-tuning for language models?

Can we reverse the instruction-following deficit through targeted training?

Related concepts in this collection 6

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 186 in 2-hop network ·dense cluster Open in graph ↗

Does training on messy search processes improve … Do reasoning traces need to be semantically correc… Why do reasoning LLMs fail at deeper problem solvi… Do reasoning models switch between ideas too frequ… Can reasoning topologies be formally classified as… Does training data format shape reasoning strategy… How quickly do errors compound during model self-t…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Do reasoning traces need to be semantically correct? Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
complementary finding: corrupted traces show content is dispensable; SoS shows PROCESS exposure is beneficial. Different mechanisms.
Why do reasoning LLMs fail at deeper problem solving? Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
SoS could address wandering by training systematic search with backtracking
Do reasoning models switch between ideas too frequently? Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
SoS teaches when to persist vs when to backtrack, addressing the premature switching problem from the training side
Can reasoning topologies be formally classified as graph types? This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.
SoS represents multiple search topologies in a unified serialized format
Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
SoS is a format intervention: serializing search processes (BFS, DFS, backtracking) as training strings shapes the resulting reasoning strategy
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
SoS's backtracking training directly counters avalanching: models learn to recognize and recover from dead ends rather than compounding errors

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

training on the search process including mistakes and backtracking produces better problem-solvers than training on optimal trajectories only

Does training on messy search processes improve reasoning?

Inquiring lines that read this note 27

Related concepts in this collection 6

Related papers in this collection 8

Search by related questions 5