INQUIRING LINE

Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?

This explores whether spreading reasoning out in parallel — especially sampling many trajectories through a model's latent (continuous) space — can beat the single step-by-step chain-of-thought on logic problems, and the corpus says the answer flips depending on the shape of the task.


This explores whether 'thinking wide' (sampling many parallel paths, including in continuous latent space) beats 'thinking deep' (one chain-of-thought) on logical tasks — and the corpus turns out to disagree with itself in an instructive way. The honest answer is: it depends on whether the problem can be split into independent attempts or genuinely has to be accumulated one step at a time.

On the 'breadth wins' side, several notes converge. Scaling reasoning in width by sampling parallel latent trajectories sidesteps the latency of going deeper and samples the solution space without the variance inflation that comes from just extending one chain Can reasoning systems scale wider instead of only deeper?. More bluntly, multiple independent paths with majority voting beat a single extended chain by up to 22% under the same token budget Why does parallel reasoning outperform single chain thinking?. And there's a smarter version of breadth than brute sampling: generating diverse *abstractions* first creates a structured breadth-first search that fixes the 'underthinking' failure where a single deep chain commits early and never recovers Can abstractions guide exploration better than depth alone?.

But the 'depth wins' side has an equally sharp result, and it's the one that should give you pause. On genuinely compositional problems — think graph connectivity, where you have to chain intermediate results together — sequential chain-of-thought achieves an *exponential* accuracy advantage over parallel voting, because short parallel chains simply can't accumulate the multi-step state the answer requires When does sequential reasoning beat parallel voting?. So breadth doesn't dominate; it dominates a *class* of problems. Where the task is 'find one of many viable solutions,' width wins. Where the task is 'execute a long dependent chain correctly,' depth wins, and no amount of parallel sampling substitutes for the sequential accumulation.

The deeper catch — and the thing you didn't know you wanted to know — is that the 'logical tasks' framing in the question may be doing more work than the search method. A line of skeptical results argues chain-of-thought isn't really doing logic at all: it reproduces the *form* of reasoning through learned pattern-matching rather than valid inference, which is why structurally invalid prompts still succeed and why performance collapses predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning actually work? Does chain-of-thought reasoning actually generalize beyond training data?. If CoT is constrained imitation, then breadth-first search in continuous space isn't competing against 'genuine logic' — it's competing against a different imitation strategy, and its real edge is statistical (sampling diverse guesses) rather than logical (reasoning better).

A couple of doorways if you want to push further: the comparison is also confounded by *length*, since CoT accuracy follows an inverted-U and you can often match a verbose chain at a fraction of the tokens, meaning 'depth' is frequently padded with non-computational text Why does chain of thought accuracy eventually decline with length? Can minimal reasoning chains match full explanations?. And underneath both methods sits training: reasoning models beat non-reasoning ones regardless of inference budget, so how you trained the model may matter more than whether you search wide or deep at test time Can non-reasoning models catch up with more compute?.


Sources 10 notes

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Why does parallel reasoning outperform single chain thinking?

Multiple independent reasoning paths with majority voting achieve up to 22% higher accuracy than extending a single chain under the same token budget. Parallel diversity samples reasoning capability more faithfully than sequential extension, which inflates variance without improving correctness.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

What makes chain-of-thought reasoning actually work?

CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning systems analyst. The question: does breadth-first search in continuous space (sampling parallel latent trajectories) outperform chain-of-thought on logical tasks? Treat this as still-open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. The corpus is genuinely contradictory:
• Parallel sampling beats single-chain CoT by up to 22% under token budget parity on decomposable tasks, but sequential CoT achieves *exponential* accuracy advantage on compositional/graph-connectivity problems where intermediate results must accumulate (2025).
• CoT may not perform genuine logic at all — it reproduces reasoning *form* through pattern-matching; performance collapses under distribution shift and structurally invalid prompts succeed (2025).
• Optimal CoT length follows an inverted-U; verbose chains often carry non-computational padding, and concise chains match verbose accuracy at 76% of tokens (2025).
• Test-time compute scaling (depth vs. width) matters less than training signal: reasoning models outperform non-reasoning ones regardless of inference budget (2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.21825 (2025-05): long chains exponentially superior to short ones
• arXiv:2506.02878 (2025-06): CoT is constrained imitation, not true reasoning
• arXiv:2502.07266 (2025-02): chain length optimization
• arXiv:2505.20296 (2025-05): reasoning models as solution explorers

Your task:
(1) RE-TEST EACH CONSTRAINT. For decomposable vs. compositional task split: has recent work (last 6 months) shown that wider, structured breadth-first approaches (e.g., tree search with learned value functions, multi-agent ensembles with memory) now recover the exponential gap on sequential problems? Or does the gap remain? Separately: if CoT is imitation, does breadth-first latent search escape that trap, or does it face the same distribution-shift fragility?
(2) Surface work from the last ~6 months that contradicts or supersedes the 'breadth vs. depth' framing — e.g., unification via meta-learning, curriculum-based reasoning, or inference-time RL that renders the width/depth distinction moot.
(3) Propose two questions: (a) Can *learned* breadth heuristics (e.g., trained value networks guiding which paths to sample) close the compositional gap without reverting to sequential depth? (b) If test-time training (e.g., RLP, RL fine-tuning at inference) becomes standard, does the width/depth comparison become obsolete?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines