SYNTHESIS NOTE
Training, RL, and Test-Time Scaling Reasoning, Retrieval, and Evaluation Model Architecture and Internals

Why do reasoning LLMs fail at deeper problem solving?

Explores whether current reasoning models systematically search solution spaces or merely wander through them, and how this affects their ability to solve increasingly complex problems.

Synthesis note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

"Reasoning LLMs are Wandering Solution Explorers" provides the most rigorous formalization yet of why reasoning models fail as problem complexity increases. The claim: current RLLMs do not systematically explore solution spaces. They wander.

Systematic exploration requires three properties: (a) validity — the trace follows the reachability structure; (b) effectiveness — the trace contains at least one goal state; (c) necessity — every state in the trace contributes to goal discovery or dead-end elimination. Current models fail all three.

The formalization makes the failure quantifiable. A wandering RLLM performing depth-first search on a binary tree of depth d has a probability pw of omitting one of two child nodes at each decision point. The success probability drops exponentially with depth d. This is not a gradual degradation — it is catastrophic. Problems that appear within reach at depth 5 become virtually impossible at depth 15 not because the model lacks reasoning ability but because it lacks search discipline.

Four failure modes are identified:

The finding directly challenges the "more thinking tokens = better reasoning" narrative. A wandering model given more tokens doesn't explore more systematically — it wanders more extensively. This is the mechanism behind Does more thinking time always improve reasoning accuracy?: additional compute doesn't fix structural search deficiency.

The exponential degradation result connects to Does policy entropy collapse limit reasoning performance in RL?. Entropy collapse reduces exploration diversity during training; wandering reduces exploration discipline during inference. Both are manifestations of the same problem: the model converges on familiar patterns rather than systematically covering the solution space.

Apple's three-regime confirmation. "The Illusion of Thinking" (Apple) provides independent confirmation through controllable puzzle environments with precise complexity manipulation. Three performance regimes emerge: (1) low-complexity — standard models outperform reasoning models with greater token efficiency; (2) medium-complexity — reasoning models gain advantage through extended thinking; (3) high-complexity — both model types collapse to zero. Near the collapse point, reasoning models reduce their reasoning effort despite having ample token budget — a counterintuitive behavioral scaling limit. Even providing explicit optimal algorithms does not prevent collapse, confirming the bottleneck is execution not conceptualization. The three-regime structure refines the wandering explorer thesis: wandering is harmful at low complexity (overthinking easy problems), partially beneficial at medium complexity (exploring toward solutions), and irrelevant at high complexity (no amount of wandering reaches the goal).

Inquiring lines that use this note as a source 105

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
19 direct connections · 192 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

reasoning llms are wandering explorers not systematic searchers — four failure modes degrade success probability exponentially with problem depth