INQUIRING LINE

Do models excel at reasoning depth or memory breadth when scaling test time compute?

This explores whether spending more compute at inference time actually buys deeper reasoning, or whether it really buys breadth — wider sampling, more coverage, more shots at the answer — and the corpus suggests the line between the two is blurrier than the question assumes.


Reading the question as a contest between 'thinking harder' (depth) and 'covering more ground' (breadth) when you scale test-time compute: the corpus's most striking move is to dissolve the contest. The cleanest version of this comes from work showing that extended thinking traces improve accuracy mainly by inflating output variance — broader distributions cover the correct answer more often — rather than by reasoning better Does extended thinking actually improve reasoning or just increase variance?. Push past a critical point and the distribution gets too diffuse and accuracy drops. In other words, a lot of what looks like depth is breadth wearing a longer chain of thought. That reframes the whole question: the gains you attribute to deeper reasoning may be coverage gains in disguise.

The field formalizes the same tension as parallel-vs-sequential scaling, which several notes treat as the recurring axis of test-time compute How should we balance parallel versus sequential compute at test time?. Parallel methods (sampling many independent attempts) buy coverage; sequential methods (long compositional chains) buy depth. The optimal mix isn't a property of the model — it's a property of the task: parallel wins for short independent problems, sequential for chains where intermediate steps accumulate. A complementary line argues you can even get depth-like benefits without paying depth's serial latency, by sampling parallel latent trajectories that explore the solution space at once Can reasoning systems scale wider instead of only deeper?. So 'breadth' keeps doing work that we instinctively credit to 'depth.'

Where the question's 'memory breadth' framing gets interesting is the evidence that models lean on stored instances rather than genuine reasoning algorithms. Reasoning breakdowns track instance-level unfamiliarity, not task complexity — any chain succeeds if the model has seen similar instances, regardless of its length Do language models fail at reasoning due to complexity or novelty?. That hints the underlying engine is pattern memory, not depth, and no amount of inference compute manufactures the missing algorithm. A related architectural finding shows hybrid systems that combine lookup memory with conditional computation beat computation alone at equal cost — and the gains show up most in reasoning and code, not pure retrieval Can lookup memory and computation work together better than either alone?. Memory and compute aren't rivals; they're complementary axes.

The sharpest limit on the 'just add compute' instinct is that the training regime, not the inference budget, sets the ceiling. Non-reasoning models can't be made to catch up to reasoning models with unlimited inference, because training instills a protocol that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. Yet within that ceiling, inference compute genuinely substitutes for model size on hard prompts Can inference compute replace scaling up model size?, and the framework you wrap around it matters less than the total budget and the quality of your value/reward function Does the choice of reasoning framework actually matter for test-time performance?. The practical lesson the corpus keeps returning to is allocation: spend adaptively per prompt rather than uniformly, since easy problems waste a fixed budget and hard ones starve under it How should we allocate compute budget at inference time?.

So the honest answer is: models don't simply 'excel at depth' or 'excel at breadth' as you scale — most measurable test-time gains route through breadth (coverage, sampling, variance), depth pays off only on compositional tasks and only up to a threshold, and both are bounded by what training and stored instances already made possible. The thing you didn't know you wanted to know: when your model gets better with more thinking, the mechanism is often that it's buying more lottery tickets, not thinking more clearly. For the cleaner taxonomy underneath all this, the internal-vs-external split is the doorway How do internal and external test-time scaling compare?.


Sources 10 notes

Does extended thinking actually improve reasoning or just increase variance?

Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.

How should we balance parallel versus sequential compute at test time?

Parallel methods improve coverage; sequential methods enable depth. The optimal choice depends on task structure: parallel wins for independent short problems, sequential for compositional chains requiring intermediate accumulation.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Can lookup memory and computation work together better than either alone?

Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

How should we allocate compute budget at inference time?

Research shows that dynamically adjusting inference compute per prompt—rather than using fixed budgets—improves performance and efficiency. Uniform spending wastes resources on easy problems while underserving hard ones.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-evaluating test-time compute trade-offs in LLMs. The question remains: *Do models excel at reasoning depth or memory breadth when scaling test-time compute?* Treat the following as dated claims (2024–2026) to be re-tested, not current ground truth.

What a curated library found — and when:
• Extended thinking traces improve accuracy mainly through output variance inflation (broader distributions), not reasoning quality; past a critical point, diffuseness drops accuracy (~2025).
• Parallel methods (independent sampling) buy coverage; sequential methods buy depth — optimal mix is task-dependent, not model-intrinsic (~2025).
• Reasoning breakdowns track instance-level unfamiliarity, not task complexity; chain length doesn't compensate for missing training patterns (~2026).
• Hybrid systems combining lookup memory with conditional computation beat compute-only approaches on reasoning and code tasks (~2026).
• Training protocol, not inference budget, sets the ceiling; non-reasoning models cannot match reasoning models even with unlimited test-time compute (~2025).
• Test-time compute substitutes for model size on hard prompts, but total reasoning budget and value/reward quality matter more than framework choice (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2506.04210 (2025-06) — Does Thinking More always Help?
• arXiv:2602.06176 (2026-02) — Large Language Model Reasoning Failures
• arXiv:2601.07372 (2026-01) — Conditional Memory via Scalable Lookup
• arXiv:2501.15602 (2025-01) — Rethinking External Slow-Thinking

Your task:
(1) **RE-TEST THE DEPTH–BREADTH SPLIT.** For each finding above, determine whether newer model scaling laws, training protocols (RL vs. SFT fine-grained separations), inference frameworks (multi-turn reflection, hierarchical memory), or evaluation benchmarks have since RELAXED the constraint that breadth dominates depth, or OVERTURNED the claim that training regime is immovable. Which limits still hold? Which have been dissolved?
(2) **Surface DISAGREEMENT.** Identify the strongest work from the last ~6 months that contradicts the "breadth-as-disguised-depth" thesis or argues training ceilings are more permeable than the library suggests. Flag any work showing depth-pure wins over sampling-only approaches.
(3) **Propose two forward questions:** (a) Can adaptive or learned routing between parallel-latent-trajectory and sequential-chain modes recover depth's efficiency without breadth's cost? (b) Does emergence of in-context abstraction learning (e.g., on-the-fly algorithm discovery) reframe memory vs. compute competition?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines