SYNTHESIS NOTE

Can non-reasoning models catch up with more compute?

Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

In verifier-free inference-time compute experiments (Think Deep, Think Fast), non-reasoning models fall substantially behind reasoning models even when given an extremely high inference budget. The gap doesn't close with more compute — it just stays there.

This sets a hard limit on Can inference compute replace scaling up model size?. The substitution works within a training regime, but not across training regimes. A standard instruction-tuned model with more inference compute cannot replicate what a model trained specifically for extended reasoning can do, even given equivalent token budgets.

Why? Reasoning models have internalized the reasoning process through training — they know how to use additional tokens productively. Non-reasoning models don't have this structure, so additional tokens degrade into noise or verbosity rather than improved reasoning. The training regime instills the reasoning protocol that makes inference compute usable.

Qualification from targeted activation (Base Models paper): The gap is substantially closeable through targeted steering of base model activations without weight updates. A hybrid model using base model weights + thinking model deployment decisions recovers 91% of the performance gap while steering only 12% of tokens. This doesn't invalidate the finding — non-reasoning models without steering still fall behind — but it significantly changes what "non-reasoning model" means in practice. If capability already exists latently and steering can surface it, the gap is about deployment mechanisms, not raw capability. See Does RL teach reasoning or just when to use it?.

The imitation learning ceiling (Tutorial on LLM Reasoning): SFT/imitation learning creates an intelligence upper bound: the model is bounded by the quality of demonstrations it learns from, unable to surpass the skill level present in training data. RL + world models is the path beyond this ceiling, because RL allows discovery of strategies that exceed any individual demonstration. This provides the mechanism for why reasoning-specific training matters: it is not merely "more training" but training that enables exceeding the imitation ceiling.

This is a strong argument for the necessity of reasoning-specific post-training, not just inference-time tricks. Compute can amplify capability but cannot manufacture it. The dependency on training regime appears to be capability-specific: Can language models learn grammar from child-scale data? — syntactic competence scales down readily, achievable with human-scale data and the right composition. Reasoning capability requires the opposite: specialized training that instills the reasoning protocol itself. The lesson is not "you need a bigger model" but "you need the right training for the capability you want."

Inquiring lines that read this note 182

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Why do reasoning models fail at systematic problem-solving and search?

Can self-supervised signals enable process supervision without human annotation?

When does architectural design matter more than raw model capacity?

What structural factors drive popularity bias in recommendation systems?

Can likelihood choice matter more than architectural depth for CF?

Do base models contain latent reasoning that training can unlock?

How does latent reasoning compare to verbalized chain-of-thought?

How should inference compute be adaptively allocated based on prompt difficulty?

How do knowledge graphs enable efficient multi-hop reasoning over alternatives?

Can the structure-routing principle apply beyond RAG to other AI reasoning systems?

Can model routing outperform monolithic scaling as an efficiency strategy?

How does example difficulty affect learning efficiency in language models?

Can inference-time compute substitute for scaling up model parameters?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

What structural advantages do diffusion language models offer over autoregressive methods?

How do knowledge injection methods compare across cost and effectiveness?

Why do self-improving systems struggle without clear external performance metrics?

How should models express uncertainty rather than forced confident answers?

How does uncertainty estimation drive computational resource allocation in models?

Does reinforcement learning teach reasoning or just when to reason?

How do training data properties shape reasoning capability development?

What capability tradeoffs emerge when scaling model reasoning abilities?

How does test-time aggregation affect reasoning correctness and reliability?

Does decoupling planning from execution improve multi-step reasoning accuracy?

Does architectural design matter more than model scale for reasoning tasks?

Do autonomous architecture discoveries follow predictable scaling laws?

Can multi-agent reasoning systems scale beyond current architectures?

How do adversarial and manipulative prompts attack reasoning models?

Which computational strategies best support reasoning in language models?

Does recurrence enable reasoning capabilities that fixed-depth transformers cannot achieve?

Can next-token prediction alone produce genuine language understanding?

How does policy entropy collapse constrain reasoning-focused reinforcement learning?

How should retrieval systems optimize for multi-step reasoning during inference?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

How do game-based benchmarks reveal reasoning fragmentation across domains?

Do language models perform faithful symbolic reasoning independent of semantic grounding?

Why do recursive belief models require different training than logical derivation?

Do corrupted reasoning traces serve as effective supervision signals?

Can alternative training methods improve on supervised fine-tuning for language models?

How do inference-time reward methods compare to per-user fine-tuning?

How should iterative research systems allocate reasoning per search step?

Why does finetuning cause catastrophic forgetting of model capabilities?

Does sparse parameter updating improve test-time training's computational cost?

Why do benchmark improvements fail to reflect actual reasoning quality?

When do additional thinking tokens stop improving reasoning performance?

How can AI systems learn from failures without cascading errors?

How should token budgets be set to prevent runaway oscillation during inference?

How do training priors constrain what context information can override?

Can goal information injected at inference time replace goal-conditioned training?

Why does verification consistently lag behind AI generation?

How does sequence length affect sparsity tolerance in models?

What pretraining choices and baseline capability constrain reinforcement learning gains?

Why does reinforcement learning suppress output diversity compared to supervised fine-tuning?

Can decoding-time prompting strategies fully replace diversity-focused training methods?

How do we evaluate AI systems when user perception misleads actual performance?

How should evaluation frameworks account for the computational cost of frontier AI capability?

Does fine-tuning modify underlying model capabilities or only behavioral outputs?

Can ensemble predictions be distilled back into a single deployable model?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

22 direct connections · 229 in 2-hop network ·dense cluster Open in graph ↗

Can non-reasoning models catch up with more comp… Can inference compute replace scaling up model siz… How do internal and external test-time scaling com… What makes test-time training actually work in pra… Can language models learn grammar from child-scale… Do base models already contain hidden reasoning ab… Does RLVR actually expand what models can reason a… Does RL post-training create reasoning or just dep…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can inference compute replace scaling up model size? Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
the limit of this substitution
How do internal and external test-time scaling compare? Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
internal TTS addresses this gap through training
What makes test-time training actually work in practice? Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
TTT is the bridge: it updates parameters at test time, potentially narrowing the gap between training regimes without full retraining
Can language models learn grammar from child-scale data? If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
contrast: syntactic competence doesn't require specialized training; reasoning does — reveals that training-regime dependencies are capability-specific
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
tension with X — both claim training matters, but differ on whether reasoning is created or activated: this note says reasoning models have internalized something non-reasoning models lack; the latent-capability finding shows base models already contain reasoning behaviors that minimal signals (steering, decoding tweaks) unlock; the gap may be activation, not capability
Does RLVR actually expand what models can reason about? Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
refines the claim: pass@k analysis shows RLVR reasoning models do not actually exceed their base model at high k — they just sample more efficiently at low k; the "non-reasoning vs reasoning" gap may be partly a sampling-efficiency gap that high-k aggregation reveals
Does RL post-training create reasoning or just deploy it? Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
reframes the mechanism: the gap is not that reasoning models "internalize the reasoning protocol" but that they have learned *when* to activate latent reasoning circuitry; non-reasoning models have the circuitry but lack the activation policy

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

non-reasoning models cannot match reasoning models even with unlimited inference budget

Can non-reasoning models catch up with more compute?

Inquiring lines that read this note 182

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4