INQUIRING LINE

Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scaling · Model Architecture and Internalscross-cluster

What patterns emerge across test-time scaling and reasoning architectures?

This explores the recurring throughlines that connect two bodies of work — methods for spending more compute at inference time (test-time scaling) and the designs that make models reason — and what those bodies of work, read side by side, reveal about each other. The first pattern is a clean organizing split: test-time scaling divides into *internal* methods (training a model to reason on its own) and *external* methods (search and verification bolted on at inference). They complement rather than compete — internal builds capability, external extracts performance from capability that already exists How do internal and external test-time scaling compare? How should test-time scaling methods be categorized and designed?. The most interesting recent moves don't just turn the compute dial up; they change *when* compute happens — sleep-time precomputation, post-completion refinement, recursive language models — sidestepping the usual scaling tradeoffs.

The second pattern is a surprising universality in the scaling curves themselves. Search budget in agentic deep-research systems follows the *same* diminishing-returns curve as reasoning tokens — every extra search step behaves like every extra chain-of-thought token Does search budget scale like reasoning tokens for answer quality? How does search scale like reasoning in agent systems? Do search steps follow the same scaling rules as reasoning tokens?. This reframes search as just another inference-compute axis you can trade against reasoning. The same logic scales up to multi-agent systems, where roughly 80% of performance variance turns out to be a function of total tokens spent, not coordination cleverness How does test-time scaling work at the agent level?. And the framework you pick matters less than you'd think: best-of-N and tree search converge once you hold total compute constant, because errors snowball per step regardless of algorithm — what saves you is search scope and reward-function quality, not the name of the method Does the choice of reasoning framework actually matter for test-time performance?.

But here's the tension that makes the two literatures speak to each other. Compute is not a free substitute for training. A non-reasoning model cannot catch a reasoning model no matter how much inference budget you hand it, because training instills a *protocol* that makes the extra tokens productive in the first place Can non-reasoning models catch up with more compute?. So the scaling curves only bite for models trained to climb them — which sends you straight into the critiques of what the reasoning actually is.

And those critiques are unsparing. Chain-of-thought degrades predictably the moment you step outside the training distribution — it imitates the *form* of reasoning without the underlying logic Does chain-of-thought reasoning actually generalize beyond training data? Why does chain-of-thought reasoning fail in predictable ways?. The sharpest evidence: logically *invalid* chain-of-thought prompts perform nearly as well as valid ones, which means the structural shape of the reasoning — not its correctness — is doing the work Does logical validity actually drive chain-of-thought gains?. Reasoning models also fail in a way that looks less like running out of compute and more like disorganization: they wander down invalid paths and abandon promising ones too early, and cheap decoding-level nudges fix it without any retraining Why do reasoning models abandon promising solution paths?.

Put the patterns together and a single shape emerges. More compute reliably buys you more performance along several interchangeable axes — but only inside a model trained with a real reasoning protocol, and only up to the point where structure, not validity, runs out. The frontier in the corpus isn't 'scale harder'; it's changing *when* the compute lands and making the reasoning itself less of a pattern-matched imitation. That's the thing worth knowing: test-time scaling and reasoning architecture are the same problem viewed from two ends — one asks how much thinking, the other asks whether it's thinking at all.

Sources 12 notes

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

How should test-time scaling methods be categorized and designed?

Research identifies internal vs external as the primary taxonomic split for test-time scaling, with training-side constraints (policy entropy collapse) and novel directions that shift *when* compute happens (sleep-time, post-completion) rather than just *how much*. Methods like consensus games and recursive LMs sidestep traditional scaling tradeoffs.

Does search budget scale like reasoning tokens for answer quality?

Agentic deep research shows monotonic-to-diminishing-returns curves for search iterations, matching reasoning token scaling. This creates a new inference-compute axis: models can trade off reasoning budget against search budget to optimize answer quality.

How does search scale like reasoning in agent systems?

Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

What patterns emerge across test-time scaling and reasoning architectures?

Sources 12 notes

Next inquiring lines