Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?
This explores whether spreading reasoning out in parallel — especially sampling many trajectories through a model's latent (continuous) space — can beat the single step-by-step chain-of-thought on logic problems, and the corpus says the answer flips depending on the shape of the task.
This explores whether 'thinking wide' (sampling many parallel paths, including in continuous latent space) beats 'thinking deep' (one chain-of-thought) on logical tasks — and the corpus turns out to disagree with itself in an instructive way. The honest answer is: it depends on whether the problem can be split into independent attempts or genuinely has to be accumulated one step at a time.
On the 'breadth wins' side, several notes converge. Scaling reasoning in width by sampling parallel latent trajectories sidesteps the latency of going deeper and samples the solution space without the variance inflation that comes from just extending one chain Can reasoning systems scale wider instead of only deeper?. More bluntly, multiple independent paths with majority voting beat a single extended chain by up to 22% under the same token budget Why does parallel reasoning outperform single chain thinking?. And there's a smarter version of breadth than brute sampling: generating diverse *abstractions* first creates a structured breadth-first search that fixes the 'underthinking' failure where a single deep chain commits early and never recovers Can abstractions guide exploration better than depth alone?.
But the 'depth wins' side has an equally sharp result, and it's the one that should give you pause. On genuinely compositional problems — think graph connectivity, where you have to chain intermediate results together — sequential chain-of-thought achieves an *exponential* accuracy advantage over parallel voting, because short parallel chains simply can't accumulate the multi-step state the answer requires When does sequential reasoning beat parallel voting?. So breadth doesn't dominate; it dominates a *class* of problems. Where the task is 'find one of many viable solutions,' width wins. Where the task is 'execute a long dependent chain correctly,' depth wins, and no amount of parallel sampling substitutes for the sequential accumulation.
The deeper catch — and the thing you didn't know you wanted to know — is that the 'logical tasks' framing in the question may be doing more work than the search method. A line of skeptical results argues chain-of-thought isn't really doing logic at all: it reproduces the *form* of reasoning through learned pattern-matching rather than valid inference, which is why structurally invalid prompts still succeed and why performance collapses predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work? Does chain-of-thought reasoning actually generalize beyond training data?. If CoT is constrained imitation, then breadth-first search in continuous space isn't competing against 'genuine logic' — it's competing against a different imitation strategy, and its real edge is statistical (sampling diverse guesses) rather than logical (reasoning better).
A couple of doorways if you want to push further: the comparison is also confounded by *length*, since CoT accuracy follows an inverted-U and you can often match a verbose chain at a fraction of the tokens, meaning 'depth' is frequently padded with non-computational text Why does chain of thought accuracy eventually decline with length? Can minimal reasoning chains match full explanations?. And underneath both methods sits training: reasoning models beat non-reasoning ones regardless of inference budget, so how you trained the model may matter more than whether you search wide or deep at test time Can non-reasoning models catch up with more compute?.
Sources 10 notes
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
Multiple independent reasoning paths with majority voting achieve up to 22% higher accuracy than extending a single chain under the same token budget. Parallel diversity samples reasoning capability more faithfully than sequential extension, which inflates variance without improving correctness.
RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.
On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.