Do models excel at reasoning depth or memory breadth when scaling test time compute?
This explores whether spending more compute at inference time actually buys deeper reasoning, or whether it really buys breadth — wider sampling, more coverage, more shots at the answer — and the corpus suggests the line between the two is blurrier than the question assumes.
Reading the question as a contest between 'thinking harder' (depth) and 'covering more ground' (breadth) when you scale test-time compute: the corpus's most striking move is to dissolve the contest. The cleanest version of this comes from work showing that extended thinking traces improve accuracy mainly by inflating output variance — broader distributions cover the correct answer more often — rather than by reasoning better Does extended thinking actually improve reasoning or just increase variance?. Push past a critical point and the distribution gets too diffuse and accuracy drops. In other words, a lot of what looks like depth is breadth wearing a longer chain of thought. That reframes the whole question: the gains you attribute to deeper reasoning may be coverage gains in disguise.
The field formalizes the same tension as parallel-vs-sequential scaling, which several notes treat as the recurring axis of test-time compute How should we balance parallel versus sequential compute at test time?. Parallel methods (sampling many independent attempts) buy coverage; sequential methods (long compositional chains) buy depth. The optimal mix isn't a property of the model — it's a property of the task: parallel wins for short independent problems, sequential for chains where intermediate steps accumulate. A complementary line argues you can even get depth-like benefits without paying depth's serial latency, by sampling parallel latent trajectories that explore the solution space at once Can reasoning systems scale wider instead of only deeper?. So 'breadth' keeps doing work that we instinctively credit to 'depth.'
Where the question's 'memory breadth' framing gets interesting is the evidence that models lean on stored instances rather than genuine reasoning algorithms. Reasoning breakdowns track instance-level unfamiliarity, not task complexity — any chain succeeds if the model has seen similar instances, regardless of its length Do language models fail at reasoning due to complexity or novelty?. That hints the underlying engine is pattern memory, not depth, and no amount of inference compute manufactures the missing algorithm. A related architectural finding shows hybrid systems that combine lookup memory with conditional computation beat computation alone at equal cost — and the gains show up most in reasoning and code, not pure retrieval Can lookup memory and computation work together better than either alone?. Memory and compute aren't rivals; they're complementary axes.
The sharpest limit on the 'just add compute' instinct is that the training regime, not the inference budget, sets the ceiling. Non-reasoning models can't be made to catch up to reasoning models with unlimited inference, because training instills a protocol that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. Yet within that ceiling, inference compute genuinely substitutes for model size on hard prompts Can inference compute replace scaling up model size?, and the framework you wrap around it matters less than the total budget and the quality of your value/reward function Does the choice of reasoning framework actually matter for test-time performance?. The practical lesson the corpus keeps returning to is allocation: spend adaptively per prompt rather than uniformly, since easy problems waste a fixed budget and hard ones starve under it How should we allocate compute budget at inference time?.
So the honest answer is: models don't simply 'excel at depth' or 'excel at breadth' as you scale — most measurable test-time gains route through breadth (coverage, sampling, variance), depth pays off only on compositional tasks and only up to a threshold, and both are bounded by what training and stored instances already made possible. The thing you didn't know you wanted to know: when your model gets better with more thinking, the mechanism is often that it's buying more lottery tickets, not thinking more clearly. For the cleaner taxonomy underneath all this, the internal-vs-external split is the doorway How do internal and external test-time scaling compare?.
Sources 10 notes
Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.
Parallel methods improve coverage; sequential methods enable depth. The optimal choice depends on task structure: parallel wins for independent short problems, sequential for compositional chains requiring intermediate accumulation.
GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.
Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.
Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.
Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.
Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.
Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.