Does training on messy search processes improve reasoning?
Can language models learn better problem-solving by observing full exploration trajectories—including mistakes and backtracking—rather than only optimal solutions? This matters because current LMs rarely see the decision-making process itself.
Language models are almost never shown fruitful mistakes during training. They see only the outcome of a decision-making process, not the process itself. Stream of Search (SoS) demonstrates what happens when you change this: train LMs on the full search process — exploration, dead ends, backtracking, pruning — represented as a serialized string.
The results on Countdown (a game requiring combining numbers with arithmetic to reach a target): SoS-pretrained models achieve 25% higher accuracy than models trained to predict only the optimal trajectory. The improvement comes from learning to search rather than learning to predict.
SoS systematizes search components into a unified language that captures multiple symbolic search strategies (BFS, DFS, and their variations) in a common serialized format. This is "intrinsic" search — the model learns an internal policy for exploration — unlike "extrinsic" approaches (ToT, GoT) that use fixed external search strategies and call the LM only for generation and evaluation. The distinction matters: extrinsic methods have high inference costs and fixed strategies, while intrinsic search is learned and adaptive.
The most striking finding: SoS models learn internal world models for search. Unlike symbolic search that relies on an explicit environment model, SoS models simulate state transitions themselves. This means the model can generalize its search strategy to novel problems without an explicitly programmed transition function.
This is distinct from the Do reasoning traces need to be semantically correct? finding. That result shows trace CONTENT is dispensable — semantically irrelevant tokens still provide computational scaffolding. SoS shows something different: the search PROCESS itself is valuable training data. It's not that mistakes don't matter (corrupted traces) — it's that the experience of making and recovering from mistakes teaches something that pure success doesn't.
The self-improvement connection is direct: after SoS pretraining, models can improve via STaR (self-taught reasoning) and APA (advantage-weighted policy aggregation) — optimizing for correctness on top of the learned search capability. This addresses the snowballing error problem (each wrong step makes subsequent steps more likely wrong) by teaching models to BACKTRACK rather than compound errors.
Since Why do reasoning LLMs fail at deeper problem solving?, SoS provides a potential training solution: if wandering exploration is the problem, training on systematic search processes (including recovery from wrong paths) could teach the systematic search strategy that current reasoning models lack.
SoS is fundamentally a training FORMAT intervention. Since Does training data format shape reasoning strategy more than domain?, representing the search process as serialized strings -- with explicit backtracking markers, dead-end annotations, and pruning decisions -- is a format choice that shapes the resulting reasoning strategy. SoS training on BFS-like exploration vs DFS-like exploration mirrors the MC/FF format distinction: the serialization format determines whether the model learns breadth-first or depth-first search behavior. And since How quickly do errors compound during model self-training?, SoS's inclusion of backtracking in training data directly addresses the avalanching vulnerability -- a model that has learned to recognize dead ends and backtrack from them is structurally less susceptible to compounding errors in self-training loops.
Inquiring lines that use this note as a source 26
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How much does organized knowledge improve learning efficiency versus raw data?
- Does partial trace guidance work better than curriculum learning for hard problems?
- Can AI-generated explanations of errors teach as effectively as self-resolution?
- How should learning environments balance error prevention with pedagogical value?
- Why do human-curated thought examples fail to improve model thinking?
- When does simulated search outperform real search for agent training?
- Can external summarization solve exploration problems in complex real-world environments?
- How does proactive critical thinking enable models to identify missing information?
- Why does exploration quality matter more than learner network depth?
- Does reflection training actually teach models to self-correct their mistakes?
- Why do difficult problems force models to develop reasoning strategies?
- Why does critique training produce deeper understanding than imitation training?
- What distinguishes systematic search from wandering exploration in reasoning?
- How does RLHF training push chatbots toward problem-solving over exploration?
- What distinguishes intrinsic search from extrinsic search method approaches?
- What happens when students encounter errors they cannot resolve through prompting alone?
- Does more thinking always improve language model accuracy?
- Can training models on backward reasoning improve their forward planning ability?
- Why does evaluating errors teach more than imitating correct responses?
- Can models adapt and combine search strategies beyond their training algorithm?
- How do failure examples improve distillation compared to successful trajectories alone?
- Why do adaptive curriculum schemes outperform static difficulty filters?
- What emergent behaviors do models develop when trained on underspecified pedagogical tasks?
- Why do students learn better from explanations than from solving problems from scratch?
- Why does extended reasoning training improve exploration without adding new capabilities?
- How does o1-style reasoning relate to learned search processes versus memorized solutions?
Related concepts in this collection 6
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Do reasoning traces need to be semantically correct?
Can models learn to solve problems from deliberately corrupted or irrelevant reasoning traces? This challenges assumptions about what makes intermediate tokens useful for learning.
complementary finding: corrupted traces show content is dispensable; SoS shows PROCESS exposure is beneficial. Different mechanisms.
-
Why do reasoning LLMs fail at deeper problem solving?
Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.
SoS could address wandering by training systematic search with backtracking
-
Do reasoning models switch between ideas too frequently?
Research explores whether o1-like models abandon promising reasoning paths prematurely by switching to different approaches without sufficient depth, and whether penalizing such transitions could improve accuracy.
SoS teaches when to persist vs when to backtrack, addressing the premature switching problem from the training side
-
Can reasoning topologies be formally classified as graph types?
This explores whether Chain of Thought, Tree of Thought, and Graph of Thought represent distinct formal graph structures with different computational properties. Understanding this matters because the topology itself determines what reasoning strategies are possible.
SoS represents multiple search topologies in a unified serialized format
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
SoS is a format intervention: serializing search processes (BFS, DFS, backtracking) as training strings shapes the resulting reasoning strategy
-
How quickly do errors compound during model self-training?
When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
SoS's backtracking training directly counters avalanching: models learn to recognize and recover from dead ends rather than compounding errors
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Stream of Search (SoS): Learning to Search in Language
- RLVMR: Reinforcement Learning with Verifiable Meta-Reasoning Rewards for Robust Long-Horizon Agents
- RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
- Teaching Large Language Models to Reason with Reinforcement Learning
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
- Reasoning LLMs are Wandering Solution Explorers
- Faith and Fate: Limits of Transformers on Compositionality
Original note title
training on the search process including mistakes and backtracking produces better problem-solvers than training on optimal trajectories only