Why do reasoning models abandon promising solution paths?
Explores whether reasoning models fail because they think insufficiently or because they structurally misorganize their thinking. Challenges the assumption that longer reasoning traces automatically improve performance.
The dominant narrative about reasoning models: they think step by step, explore the solution space, and arrive at answers through deliberation. The reality: they wander.
The formalization. Systematic exploration requires three properties: validity (following legal transitions), effectiveness (reaching goals), and necessity (no wasted states). Current reasoning LLMs fail all three. A model performing DFS on a binary tree of depth d with branch-omission probability pw sees success drop exponentially: problems that look tractable at depth 5 become impossible at depth 15.
The complementary failure. Separately, o1-like models exhibit "underthinking" — not too little total reasoning, but too little depth per reasoning thread. The model starts down a promising path, encounters difficulty, switches to another approach, encounters difficulty there, switches again. The result is a long trace (many tokens) with shallow exploration (insufficient depth on any single path).
Why both matter together. Wandering and underthinking are not the same failure mode, but they reinforce each other. A model that switches approaches prematurely (underthinking) generates more abandoned branches to wander between (wandering). More compute doesn't fix either — a wandering model given more tokens wanders more extensively, and an underthinking model given more tokens switches more frequently.
The practical fix is surprising. TIP (Thought-switching Penalty) is a pure decoding strategy that penalizes tokens signaling thought transitions. It improves accuracy without fine-tuning — just by encouraging the model to stay on its current path longer. The implication: the model often had a viable path and abandoned it prematurely. The answer was reachable from the original approach.
This reframes the entire "scale inference compute" research program. The bottleneck is not how much the model thinks — it is how it structures its thinking. A tourist visiting more landmarks is not the same as a scientist following a hypothesis to its conclusion.
Supporting material:
Inquiring lines that use this note as a source 230
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do foundation models develop heuristics instead of world models?
- Can surface heuristics override implicit constraints in domain-specific reasoning?
- How do unstated feasibility constraints affect model decision-making?
- When does knowledge activation fail across different model architectures?
- Why does step-by-step reasoning fail when tool outputs get very large?
- Can corrupted reasoning traces be reliably distinguished from correct ones?
- What distinguishes planning knowledge from an executable plan that works?
- Can explicit constraint statements override the dominance of surface heuristics?
- Why do single examples trigger large reasoning improvements in models?
- Can penalizing reasoning transitions fix underthinking without fine-tuning models?
- Can reflection in reasoning models be corrective rather than just confirmatory?
- Can step-level deliberation flags guide other reasoning systems?
- What design principles prevent error cascades in multi-step evaluation systems?
- Can graph cyclicity and topology predict when reasoning systems achieve breakthrough insights?
- Are correct reasoning traces measurably shorter than incorrect ones?
- Why must procedural skills consolidate before strategic reasoning can develop?
- Can a proposer agent actively surface a solver's weaknesses to prevent plateau?
- How do agents revise their own errors during autonomous architecture discovery?
- Can a single SAE feature control reasoning behavior across model families?
- How does critique fine-tuning on one problem unlock broader reasoning?
- What architectural features enable counterfactual reasoning in world models?
- Why do contrastive reasoning approaches outperform single-path belief evaluation?
- Can the structure-routing principle apply beyond RAG to other AI reasoning systems?
- What advantages emerge from running 13 times more parallel reasoning chains with the same budget?
- Do tool-enabled reasoning models close the gap on constraint satisfaction?
- Why do reasoning models fail on structurally unfamiliar instances?
- How do self-revisions degrade reasoning accuracy in extended traces?
- Why do correct reasoning traces appear shorter than incorrect ones?
- Why does chain-of-thought fail when problems lack matching training schemata?
- Can reasoning traces prove models are actually reasoning versus mimicking?
- How do planning and backtracking sentences control reasoning traces?
- Can diverse critiques on a single problem unlock reasoning without diverse problem sets?
- Can external verifiers replace reasoning trace quality in solution guarantees?
- How does scene-switching prevent cross-problem interference in multi-agent reasoning?
- Why does most refinement in iterative models maintain answers rather than improve them?
- Do reasoning models show the same answer-maintenance pattern that diffusion models exhibit?
- Can routing systems prevent expert models from failing outside their specialty?
- Why does fine-tuning degrade reasoning quality even as accuracy improves?
- Why does fine-tuning sometimes damage chain-of-thought reasoning even when accuracy improves?
- Can evolutionary approaches avoid the overthinking failure mode of iterative refinement?
- Why do models automatically adjust reasoning length to problem difficulty?
- What triggers overthinking versus underthinking in reasoning models?
- Can adaptive compute distribution across prompts replace the need for sophisticated reasoning frameworks?
- Can subtask-level voting replace sequential revision for improving long-horizon task accuracy?
- Why do shorter correct reasoning traces contain fewer failed branches?
- How do failed branches remain in context and contaminate subsequent reasoning?
- Can removing failed branches from edited traces improve previous mistakes?
- Does parallel sampling avoid failed-branch contamination more than sequential thinking?
- Why do models fail on logically equivalent tasks with different data distributions?
- Does more inference compute help reasoning models match specialized domain performance?
- When does explicit reasoning actually degrade performance on a task?
- Why are correct reasoning traces consistently shorter than incorrect ones?
- Does architectural design matter more than model scale for reasoning tasks?
- Do models trained for safety over-refuse compared to models trained for reasoning?
- Can reasoning traces serve purposes beyond producing the final answer itself?
- How do chain-of-thought structures affect reasoning robustness?
- Why do temporal reasoning patterns matter more than final answers?
- How does reasoning instability prevent models from modeling individuals?
- Why do simple math problems get worse with longer reasoning chains?
- Why does step-by-step reasoning degrade performance on judgment-based tasks?
- What saliency patterns distinguish successful from failed chain-of-thought reasoning?
- Why does self-revision degrade reasoning accuracy in o1-like models?
- Can parallel independent reasoning outperform sequential iterative refinement?
- How do graph topology properties like cyclicity and diameter affect reasoning quality?
- Why do longer reasoning chains signal hesitation rather than depth?
- Does reasoning structure match explicit versus implicit task demands?
- How does evaluation format change what we measure about model reasoning?
- What makes a novel research idea practically infeasible for implementation?
- How do foundation models develop task-specific heuristics instead of world models?
- Do reasoning models perform genuine logical evaluation or pattern matching?
- Why does extended reasoning fail for search and knowledge retrieval tasks?
- Why does iterative refinement amplify rather than correct reasoning errors?
- Does reasoning trace style explain why RL post-training improves model reasoning?
- Why do reasoning models struggle with self-evaluation and revision?
- Do shorter reasoning traces actually produce more reliable model outputs?
- What makes multi-hypothesis generation better than single-path social reasoning?
- When does self-reflection actually help reasoning models improve?
- Does reflection destabilize reasoning in dynamic environments?
- Do reasoning models overthink ill-posed questions instead of recognizing incompleteness?
- How does proactive critical thinking enable models to identify missing information?
- Why do correct reasoning traces tend to be shorter than incorrect ones?
- Why do reasoning models verbalize reasoning shortcuts less than necessary?
- Can explicit rejection responses solve the over-specialization failure mode?
- Why do some reasoning models fail to detect redundancy in concurrent coordination?
- What makes diverse reasoning sources more valuable than deeper single paths?
- Can prompt engineering improve reasoning or only move requests into denser regions?
- Why does reflection in reasoning models stay confirmatory instead of corrective?
- Why does inference-time thinking hurt proactive critical thinking in vanilla models?
- How does RL refine reasoning paths without simply adding model capability?
- How does the functional separation of knowledge and reasoning affect adaptation methods?
- Why do reasoning models fail when input length increases even below context limits?
- Why do reasoning chains degenerate into undirected exploration at scale?
- How does separating decomposition from execution improve multi-step reasoning?
- Do reasoning systems reuse cognitive structures across unrelated topics?
- Why do reasoning models confidently generate wrong answers instead of abstaining?
- Do base models and reasoning models fail in opposite directions on uncertainty?
- Why do reasoning models reduce effort despite having token budget remaining?
- Can explicit optimal algorithms prevent reasoning model collapse at high complexity?
- Can recursive subtask trees implement tree-of-thought reasoning more efficiently?
- How does graph of thoughts enable divide-and-conquer reasoning patterns?
- What makes multi-paradigm chaining a distinct reasoning topology?
- Why do reasoning models wander instead of searching systematically?
- How do longer reasoning chains create vulnerability to attacks?
- Is the reasoning cliff actually a tool-use problem?
- Why do larger reasoning models show cyclicity only in later layers?
- Can deliberate corruption of reasoning traces harm out of distribution generalization?
- Why do reasoning models produce unfaithful or unhelpful reasoning traces?
- Why do verbalized reasoning chains fail on certain problem classes?
- Why does overthinking degrade performance at extreme recursion depths?
- Why do reasoning traces resemble mimicry rather than verified problem-solving?
- Why does revision often make reasoning accuracy worse in frontier models?
- Why do linear research pipelines lose global context across planning and generation steps?
- Why does outcome supervision fail for long reasoning chains?
- Why does reflection in reasoning models tend to be confirmatory rather than corrective?
- Why do difficult problems force models to develop reasoning strategies?
- What distinguishes coherent reasoning from inaccurate but plausible predictions?
- Why does reasoning graph topology evolve differently across training phases?
- Do higher asymptote recipes unlock genuinely novel reasoning strategies?
- What distinguishes systematic search from wandering exploration in reasoning?
- Why does extending reasoning traces worsen persona consistency?
- Why do longer reasoning chains correlate with lower accuracy in o1-like models?
- What changes when reasoning models adopt trajectory-response output formats?
- Why do models overthink underspecified problems instead of rejecting them?
- How does soft thinking compare to sampling multiple independent reasoning paths?
- Does this reasoning steering method work consistently across all model sizes?
- How does extended thinking affect variance in reasoning model outputs?
- Why does reflection in reasoning models confirm rather than correct initial directions?
- Do correct reasoning traces tend to be shorter than incorrect ones?
- Why do reasoning-optimized models show no sycophancy resistance advantage?
- Why do some reasoning steps receive negligible attention from later steps?
- Can static reasoning patterns work better than dynamic branch selection?
- Can test-time voting improve reasoning beyond the base model's original capabilities?
- How does collaboration itself become a degradation mechanism in reasoning tasks?
- Why do reasoning models fail at learning hidden rules from sparse exceptions?
- When is detailed step-by-step reasoning actually counterproductive for solving a problem?
- How can correct explanations coexist with failed applications in AI?
- Why does reasoning fine-tuning reduce a model's ability to abstain?
- Why do models skip steps that would make reasoning clearer?
- Can reasoning models succeed at logic but fail at execution?
- Do reasoning failures stem from strategy or from calculation breakdown?
- Does unrestricted reasoning per search step degrade iterative quality over time?
- Do search agents face their own overthinking threshold like reasoning models do?
- Does penalizing thought transitions improve reasoning without model retraining?
- Why does more inference compute amplify wandering rather than solving it?
- Do reasoning models switch approaches when encountering local difficulty?
- How can high benchmark performance mask broken reasoning in AI systems?
- Why do foundation models develop task-specific heuristics instead of causal understanding?
- How does backtracking capability address error compounding in chain-of-thought reasoning?
- Why does failed step fraction predict reasoning quality better than trace length?
- How can prompt intervention reduce redundant reasoning steps dynamically?
- Why do correct reasoning traces stay shorter than incorrect ones?
- Why are incorrect reasoning traces longer than correct ones?
- Can minimal reasoning steps match verbose reasoning accuracy?
- What mechanisms cause reasoning models to wander rather than focus?
- What debugging behaviors signal that a user has abandoned the coding loop?
- Why do some students restart entire projects instead of debugging incrementally?
- How do single wrong steps corrupt entire reasoning chains?
- Can multi-agent debate prevent reasoning models from amplifying errors?
- Why do reasoning model failures stem from execution rather than reasoning?
- What separates good workflow design from poor workflow design?
- How should benchmarks evaluate workflow architecture versus raw model performance?
- What happens to model reasoning accuracy as thinking token requirements exceed critical thresholds?
- Does algorithmic decomposition prevent planning-execution interference in reasoning?
- Can operationalizing theory into prompt structure improve reasoning more than theory itself?
- Why does extended chain-of-thought reasoning fail to improve numerical optimization performance?
- Why do reasoning models fail to improve constrained optimization performance?
- Why do frontier models corrupt more documents than weaker models during workflows?
- What planning strategies reduce execution steps without sacrificing solution quality?
- Can verification loops and decomposition fix judgment failures?
- Why does RL behavior differ between standard reasoning tasks and complex planning domains?
- What makes planning, tool use, and reasoning into jointly optimizable subsystems?
- How does making implicit reasoning requirements explicit change model performance?
- What failure modes emerge when scheme classification feeds downstream reasoning pipelines?
- How can process reward models handle branching and revisiting in reasoning traces?
- Can training models on backward reasoning improve their forward planning ability?
- What role do local backtracking steps play in reasoning traces?
- What happens to iterative search quality when reasoning depth is unconstrained?
- Why does a replay mechanism prevent reasoner skills from over-specializing?
- How do progressive abstraction chains differ from branching reasoning topologies?
- Why do wrong numbers cost less accuracy than shuffled reasoning steps?
- How does planning-before-execution compare to iterative reasoning and action loops?
- What causes reasoning quality to degrade during long research tasks?
- Why do standard process reward models struggle with branching reasoning traces?
- How much of a reasoning trace is actually redundant or unnecessary?
- Can benchmark improvements hide degradation of deliberative reasoning?
- Why does per-step deliberation lose global perspective compared to dynamic discovery?
- What distinguishes graph-of-thought reasoning from other structured reasoning topologies?
- How can reasoning quality be verified before integrating new information into a reasoning graph?
- What limits external scaling when a model lacks reasoning foundation?
- Why might chain-of-thought reasoning bypass action selection pathways?
- What makes answer equivalence sufficient to discard a reasoning path?
- Do linearized traces genuinely expand exploration beyond standard chain-of-thought?
- What distinguishes genuine capability gains from coherent but invalid reasoning traces?
- Why do reasoning traces persuade users without improving their accuracy?
- Does preference tuning help or hurt the exploration of solution spaces in code?
- Does performative reasoning mask underlying uncertainty even on easy problems?
- Why does reflection in reasoning models mostly confirm the first answer?
- Why do longer reasoning chains explore like tourists instead of scientists?
- What makes deterministic recursive reasoning models underperform on multi-solution tasks?
- Do reasoning benchmarks predict real performance in long delegated workflows?
- What makes a thinking trace take information shortcuts?
- How can benchmark accuracy scores mask the absence of interpretable reasoning structure?
- Can partial solution traces convert unproductive hard samples into learnable training data?
- Why do shorter confident reasoning traces fail on out-of-distribution problems?
- Why do thinking models execute longer tasks than standard language models?
- Can completeness scaffolding substitute for actual code execution in reasoning?
- How does active reasoning through interaction differ from passive single-turn problem solving?
- What makes reasoning traces effective or ineffective for solving problems?
- Can backward planning reduce search difficulty when multiple goal state paths exist?
- Why does single-shot learning fail in REVTHINK's multi-source reasoning tasks?
- When does RL discover genuinely novel reasoning strategies versus timing optimization?
- How does confidence filtering improve selection of reasoning traces?
- Is reasoning failure caused by task complexity or training distribution gaps?
- How should AI ideation systems decompose and recombine research concepts?
- Can structured workflows unlock latent reasoning abilities that raw models don't show?
- Why are shorter reasoning traces more reliable than longer correct ones?
- Can we detect redundant reasoning steps during model inference instead of training?
- Why do reasoning traces fail to accurately reflect model decision-making?
- Why do non-experts default to familiar chart types despite domain complexity?
- How do search and reasoning workflows improve forecasting performance over base models?
- Why does reasoning backward enable better forward reasoning performance?
- Can indirect and direct reasoning methods be combined to improve results?
- Why does reflection in reasoning models often become theater rather than genuine thought?
- How can we turn reasoning model failures into useful training signals?
- How does o1-style reasoning relate to learned search processes versus memorized solutions?
- What role do cyclic fixed points play in stable reasoning?
- What makes multi-turn critique trajectories more effective than single-turn reasoning chains?
- Why does strategy diversity within reasoning chains improve model generalization?
- How does early commitment in reasoning differ from early exploitation in planning?
- Can tools unlock reasoning strategies that require abstract insight beyond computation?
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning LLMs are Wandering Solution Explorers
- Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
- Large Language Model Reasoning Failures
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Large Language Models Think Too Fast To Explore Effectively
- DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
- A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap
- Reasoning Can Hurt the Inductive Abilities of Large Language Models
Original note title
the wandering mind — why reasoning models explore like tourists not scientists