INQUIRING LINE

Inquiring lines›How should agents manage and coord…›How do multi-agent reasoning syste…›Does parallel reasoning outperform…›this inquiring line

Can an AI reason better by trying many paths at once instead of following one careful chain of thought?

Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?

This explores whether spreading reasoning out in parallel — especially sampling many trajectories through a model's latent (continuous) space — can beat the single step-by-step chain-of-thought on logic problems, and the corpus says the answer flips depending on the shape of the task.

This explores whether 'thinking wide' (sampling many parallel paths, including in continuous latent space) beats 'thinking deep' (one chain-of-thought) on logical tasks — and the corpus turns out to disagree with itself in an instructive way. The honest answer is: it depends on whether the problem can be split into independent attempts or genuinely has to be accumulated one step at a time.

On the 'breadth wins' side, several notes converge. Scaling reasoning in width by sampling parallel latent trajectories sidesteps the latency of going deeper and samples the solution space without the variance inflation that comes from just extending one chain Can reasoning systems scale faster by exploring parallel paths instead?. More bluntly, multiple independent paths with majority voting beat a single extended chain by up to 22% under the same token budget Why does parallel reasoning outperform single chain thinking?. And there's a smarter version of breadth than brute sampling: generating diverse *abstractions* first creates a structured breadth-first search that fixes the 'underthinking' failure where a single deep chain commits early and never recovers Can abstractions guide exploration better than depth alone?.

But the 'depth wins' side has an equally sharp result, and it's the one that should give you pause. On genuinely compositional problems — think graph connectivity, where you have to chain intermediate results together — sequential chain-of-thought achieves an *exponential* accuracy advantage over parallel voting, because short parallel chains simply can't accumulate the multi-step state the answer requires When does sequential reasoning beat parallel voting?. So breadth doesn't dominate; it dominates a *class* of problems. Where the task is 'find one of many viable solutions,' width wins. Where the task is 'execute a long dependent chain correctly,' depth wins, and no amount of parallel sampling substitutes for the sequential accumulation.

The deeper catch — and the thing you didn't know you wanted to know — is that the 'logical tasks' framing in the question may be doing more work than the search method. A line of skeptical results argues chain-of-thought isn't really doing logic at all: it reproduces the *form* of reasoning through learned pattern-matching rather than valid inference, which is why structurally invalid prompts still succeed and why performance collapses predictably under distribution shift Does chain-of-thought reasoning reveal genuine inference or pattern matching? What makes chain-of-thought reasoning fail in language models? Does chain-of-thought reasoning actually generalize beyond training data?. If CoT is constrained imitation, then breadth-first search in continuous space isn't competing against 'genuine logic' — it's competing against a different imitation strategy, and its real edge is statistical (sampling diverse guesses) rather than logical (reasoning better).

A couple of doorways if you want to push further: the comparison is also confounded by *length*, since CoT accuracy follows an inverted-U and you can often match a verbose chain at a fraction of the tokens, meaning 'depth' is frequently padded with non-computational text Why does chain of thought accuracy eventually decline with length? Can minimal reasoning chains match full explanations?. And underneath both methods sits training: reasoning models beat non-reasoning ones regardless of inference budget, so how you trained the model may matter more than whether you search wide or deep at test time Can non-reasoning models catch up with more compute?.

Sources 10 notes

Can reasoning systems scale faster by exploring parallel paths instead?

GRAM demonstrates that recursive reasoning models should maintain and explore multiple latent trajectories in parallel, not only deepen single paths. Width-scaling avoids the serial latency penalty of depth while sampling the solution distribution more effectively on ambiguous problems.

Why does parallel reasoning outperform single chain thinking?

Multiple independent reasoning paths with majority voting achieve up to 22% higher accuracy than extending a single chain under the same token budget. Parallel diversity samples reasoning capability more faithfully than sequential extension, which inflates variance without improving correctness.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

When does sequential reasoning beat parallel voting?

On structured tasks requiring sequential multi-step reasoning like graph connectivity, chain-of-thought achieves exponentially higher accuracy than parallel voting. The difference emerges because solutions genuinely require accumulating intermediate results sequentially, which short parallel chains cannot achieve.

Does chain-of-thought reasoning reveal genuine inference or pattern matching?

CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.

Show all 10 sources

What makes chain-of-thought reasoning fail in language models?

Research shows CoT mirrors reasoning form without true logical abstraction. Format matters more than content, invalid prompts work as well as valid ones, and scaling reasoning creates instruction-following deficits.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Can minimal reasoning chains match full explanations?

Chain of Draft achieves equivalent accuracy to standard chain-of-thought on arithmetic, symbolic, and commonsense tasks while using only 7.6% of tokens. The 92.4% of removed tokens served style and documentation, not computation.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

When More is Less: Understanding Chain-of-Thought Length in LLMs5.28 match · arxiv ↗
Break the Chain: Large Language Models Can be Shortcut Reasoners5.22 match · arxiv ↗
Hierarchical Reasoning Model3.50 match · arxiv ↗
RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems3.39 match · arxiv ↗
CoT is Not True Reasoning, It Is Just a Tight Constraint to Imitate: A Theory Perspective2.71 match · arxiv ↗
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens2.68 match · arxiv ↗
Measuring Faithfulness in Chain-of-Thought Reasoning2.64 match · arxiv ↗
Chain of Thoughtlessness? An Analysis of CoT in Planning2.55 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a reasoning systems analyst. The question: does breadth-first search in continuous space (sampling parallel latent trajectories) outperform chain-of-thought on logical tasks? Treat this as still-open.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2025. The corpus is genuinely contradictory:
• Parallel sampling beats single-chain CoT by up to 22% under token budget parity on decomposable tasks, but sequential CoT achieves *exponential* accuracy advantage on compositional/graph-connectivity problems where intermediate results must accumulate (2025).
• CoT may not perform genuine logic at all — it reproduces reasoning *form* through pattern-matching; performance collapses under distribution shift and structurally invalid prompts succeed (2025).
• Optimal CoT length follows an inverted-U; verbose chains often carry non-computational padding, and concise chains match verbose accuracy at 76% of tokens (2025).
• Test-time compute scaling (depth vs. width) matters less than training signal: reasoning models outperform non-reasoning ones regardless of inference budget (2025).

Anchor papers (verify; mind their dates):
• arXiv:2505.21825 (2025-05): long chains exponentially superior to short ones
• arXiv:2506.02878 (2025-06): CoT is constrained imitation, not true reasoning
• arXiv:2502.07266 (2025-02): chain length optimization
• arXiv:2505.20296 (2025-05): reasoning models as solution explorers

Your task:
(1) RE-TEST EACH CONSTRAINT. For decomposable vs. compositional task split: has recent work (last 6 months) shown that wider, structured breadth-first approaches (e.g., tree search with learned value functions, multi-agent ensembles with memory) now recover the exponential gap on sequential problems? Or does the gap remain? Separately: if CoT is imitation, does breadth-first latent search escape that trap, or does it face the same distribution-shift fragility?
(2) Surface work from the last ~6 months that contradicts or supersedes the 'breadth vs. depth' framing — e.g., unification via meta-learning, curriculum-based reasoning, or inference-time RL that renders the width/depth distinction moot.
(3) Propose two questions: (a) Can *learned* breadth heuristics (e.g., trained value networks guiding which paths to sample) close the compositional gap without reverting to sequential depth? (b) If test-time training (e.g., RLP, RL fine-tuning at inference) becomes standard, does the width/depth comparison become obsolete?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can an AI reason better by trying many paths at once instead of following one careful chain of thought?

Related lines of inquiry

Sources 10 notes

Papers this line draws on 8