Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
In verifier-free inference-time compute experiments (Think Deep, Think Fast), non-reasoning models fall substantially behind reasoning models even when given an extremely high inference budget. The gap doesn't close with more compute — it just stays there.
This sets a hard limit on Can inference compute replace scaling up model size?. The substitution works within a training regime, but not across training regimes. A standard instruction-tuned model with more inference compute cannot replicate what a model trained specifically for extended reasoning can do, even given equivalent token budgets.
Why? Reasoning models have internalized the reasoning process through training — they know how to use additional tokens productively. Non-reasoning models don't have this structure, so additional tokens degrade into noise or verbosity rather than improved reasoning. The training regime instills the reasoning protocol that makes inference compute usable.
Qualification from targeted activation (Base Models paper): The gap is substantially closeable through targeted steering of base model activations without weight updates. A hybrid model using base model weights + thinking model deployment decisions recovers 91% of the performance gap while steering only 12% of tokens. This doesn't invalidate the finding — non-reasoning models without steering still fall behind — but it significantly changes what "non-reasoning model" means in practice. If capability already exists latently and steering can surface it, the gap is about deployment mechanisms, not raw capability. See Does RL teach reasoning or just when to use it?.
The imitation learning ceiling (Tutorial on LLM Reasoning): SFT/imitation learning creates an intelligence upper bound: the model is bounded by the quality of demonstrations it learns from, unable to surpass the skill level present in training data. RL + world models is the path beyond this ceiling, because RL allows discovery of strategies that exceed any individual demonstration. This provides the mechanism for why reasoning-specific training matters: it is not merely "more training" but training that enables exceeding the imitation ceiling.
This is a strong argument for the necessity of reasoning-specific post-training, not just inference-time tricks. Compute can amplify capability but cannot manufacture it. The dependency on training regime appears to be capability-specific: Can language models learn grammar from child-scale data? — syntactic competence scales down readily, achievable with human-scale data and the right composition. Reasoning capability requires the opposite: specialized training that instills the reasoning protocol itself. The lesson is not "you need a bigger model" but "you need the right training for the capability you want."
Inquiring lines that use this note as a source 162
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- What design changes could make constraint inference more reliable without explicit cuing?
- Can explicit goal state scaffolding at inference time transfer to autonomous tracking through training?
- How do larger models maintain more parallel tasks than smaller models?
- Can likelihood choice matter more than architectural depth for CF?
- Can latent reasoning architectures work as retrofits to existing models?
- How does step-level compute allocation compare to response-level thinking?
- How should we allocate compute between reasoning and retrieval iterations?
- Can the structure-routing principle apply beyond RAG to other AI reasoning systems?
- Can model routing and compute allocation work together as independent optimizations?
- How do byte-level models allocate compute without explicit difficulty estimators?
- Does test-time compute actually substitute for having larger model parameters?
- What advantages emerge from running 13 times more parallel reasoning chains with the same budget?
- How does inference compute substitution affect the training parameter scaling trade-off?
- Can adaptive prompt-difficulty allocation compound with architectural efficiency improvements?
- How do sub-token and architecture-level compute optimization strategies compare?
- Can offline context optimization reduce test-time latency like sleep-time compute?
- Can budget-tightening curricula improve reasoning efficiency more than fixed budgets?
- Can test-time scaling prioritize genuine reasoning over pattern matching?
- Does the DeepSeek R1 single token insertion represent genuine reasoning?
- Can sequential computation through depth solve problems that parallel width cannot?
- Can architecture changes and early stopping combine to close the diffusion inference gap?
- How do training-time and inference-time knowledge injection techniques compare?
- How does iteration cycle time constrain autonomous research budgets?
- How does uncertainty estimation drive computational resource allocation in models?
- Does inference-time compute scaling require explicit reasoning traces or verifiable rewards?
- Can energy minimization replace reasoning-specific reinforcement learning for system 2 thinking?
- How do gradient descent iterations at inference compare to chain-of-thought reasoning chains?
- Can smaller models actually perform well on specific downstream tasks?
- Why do open-source models trained on proprietary outputs still fail at reasoning?
- Why does joint optimization of prompts and inference strategy outperform separate tuning?
- Can parallel thinking outperform sequential thinking under the same token budget?
- Can adaptive compute distribution across prompts replace the need for sophisticated reasoning frameworks?
- Why do non-reasoning models work better under extreme decomposition than reasoning models?
- Why do models fail on logically equivalent tasks with different data distributions?
- How does test-time compute substitute for model parameter scaling?
- Does more inference compute help reasoning models match specialized domain performance?
- How does training-time voting differ from inference-time majority voting over samples?
- How should compute budgets be allocated across multi-stage RAG architectures?
- Can test-time compute on smaller models replace larger model inference?
- Does architectural design matter more than model scale for reasoning tasks?
- What mechanisms drive test-time compute allocation in reasoning tasks?
- Can multi-agent reasoning systems scale beyond current architectures?
- How should inference budget adapt based on problem difficulty?
- Can smaller specialist models outperform large generalist models on domain tasks?
- Does test-time compute scaling work for agentic deep research tasks?
- Can minimal adversarial triggers disrupt reasoning across multiple unrelated queries?
- Does model scaling improve knowledge storage faster than reasoning ability?
- Do task-specific heuristics emerge because they compress well enough?
- How does per-token adaptive compute improve efficiency in recurrent reasoning?
- Does trading model size for inference steps improve overall efficiency scaling?
- Can breadth-first search in continuous space outperform chain-of-thought on logical tasks?
- Could graph neural networks fundamentally outperform transformers on structured reasoning?
- Do decoder-only models have inherent architectural limits for non-sequential information?
- How much does inference budget improve self-generated search performance?
- How does test-time scaling relate to token budget in agentic deep research?
- Can any practitioner apply multi-token prediction without massive compute?
- Do models excel at reasoning depth or memory breadth when scaling test time compute?
- What makes reasoning-specific post-training different from standard parameter scaling?
- How should inference-time token budgets vary across models of different capability levels?
- What distinguishes training-time entropy collapse from test-time variance inflation?
- How can inference-time retrieval avoid the domain boundary problem?
- Can any architecture fundamentally solve problems that require inherently sequential computation?
- Does deep-thinking ratio measure computational effort better than chain-of-thought length?
- How much inference efficiency do we gain by eliminating self-correction passes?
- Can post-thinking compute on memory reduce query-time reasoning costs?
- What are the computational trade-offs between training-time vs inference-time consistency correction?
- How much do structural inductive biases matter compared to training data volume?
- Why does inference-time thinking hurt proactive critical thinking in vanilla models?
- Can compute-optimal scaling work without co-optimizing the prompt itself?
- How should token budgets be allocated when prompt-inference coupling matters?
- What limits exist on retrieval budget during inference?
- Can test-time compute allocation shift from solutions to strategies?
- Can models compress reasoning chains without external teacher supervision?
- How does constraint complexity relate to optimal reasoning token budgets?
- How does policy entropy during training affect search discipline during inference?
- How do game-based benchmarks reveal reasoning fragmentation across domains?
- Why do recursive belief models require different training than logical derivation?
- Can models trained on longer contexts develop better fundamental reasoning abilities?
- How does post-training on traces improve performance without semantic reasoning?
- How much does test-time compute improve reasoning without more tokens?
- How do inference-time reward methods compare to per-user fine-tuning?
- Why do long-horizon reasoning tasks need per-turn step limits rather than just compute budgets?
- Can optimization algorithms exploit the shift between procedural and planning bottlenecks?
- Do higher asymptote recipes unlock genuinely novel reasoning strategies?
- What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?
- Why does parallel sampling fail on graph connectivity tasks?
- How does task structure determine optimal test-time compute allocation?
- Can bounded-depth transformers solve inherently sequential problems?
- How should inference compute budget be allocated across different prompt difficulties?
- Can inference budgets be allocated differently based on prompt difficulty?
- Could deploying GPT-4 for everyone require 100 million specialized chips?
- How should inference budgets adapt based on prompt difficulty?
- Where does inference compute stop substituting for model capacity?
- Can compute allocation and model routing be combined for better results?
- What makes routing a better investment than training larger models?
- How should timing for reasoning intervention be determined during inference?
- Can test-time voting improve reasoning beyond the base model's original capabilities?
- Does sparse parameter updating improve test-time training's computational cost?
- Does inference-time compute improve pretraining data efficiency in practice?
- How much reasoning depth do we actually need for most real-world tasks?
- Can weaker models match stronger ones with sufficient search and reasoning budget?
- Why do AI benchmarks measure accuracy instead of reasoning quality?
- How does reasoning accuracy degrade when token budgets exceed critical thresholds?
- Why does more inference compute amplify wandering rather than solving it?
- How should token budgets be set to prevent runaway oscillation during inference?
- Can a model be strong at MMLU but weak at long-horizon tasks?
- Why do reasoning models fail to improve constrained optimization performance?
- Do base models contain latent reasoning that minimal training can unlock?
- Can activation steering vectors compress reasoning without retraining models?
- Can test-time compute budgets be allocated differently per query difficulty?
- Does latent reasoning capability exist in base models before any training?
- Can goal information injected at inference time replace goal-conditioned training?
- Does decoupling reasoning reduce inference cost more than sequential scaling?
- Can memory and test-time compute scale together as a single axis?
- How does test-time verification decouple the act of checking from reasoning generation?
- Why can generative verifiers scale verification compute more effectively than fixed-output discriminative models?
- What makes inference budgets allocate adaptively per prompt difficulty?
- Can a single model implement fast thinking, slow thinking, and tool use?
- Why does second-hop reasoning fail when composed with out-of-distribution triples?
- Should production deployments scale budgets with sequence length for sparse models?
- What limits external scaling when a model lacks reasoning foundation?
- Can models reason at inference without specialized internal training?
- Why does parallel sampling become more efficient when reasoning branches are memoryless?
- Why do hybrid memory and compute sparsity outperform pure parameter scaling?
- Can smaller amounts of diverse reasoning demonstrations replace exhaustive factual training data?
- Can compute budget scaling replace annotation budget in process supervision training?
- How much training data is truly necessary to unlock latent model reasoning?
- Can energy-based transformers achieve deep reasoning without supervision?
- Why does parallel thinking outperform sequential thinking under fixed token budgets?
- Can sleep-time compute reduce latency demands during model inference?
- What inference-time scaling benefits emerge from reasoning before each prediction?
- Can activation steering compress reasoning without retraining models?
- How much do compressed reasoning traces transfer across different models?
- Can distillation from stronger models create genuinely new reasoning abilities?
- What does pass@k reveal about base model reasoning capacity?
- How should benchmark design account for task-dependent sparsity tolerance differences?
- Does sparsity-guided ordering work equally well for reasoning and classification tasks?
- Why does target probability matter more than task logical complexity?
- Can test-time compute fully replace scaling model parameters on hard problems?
- How does spending offline compute affect wake-time prediction latency?
- What computational structures can actually scale serial reasoning depth?
- Can standard next-token prediction capture complex multi-step human reasoning directly?
- Can base models spontaneously produce reasoning traces without any RL training?
- How do sparse parameter updates enable when-not-how training to work?
- Can single-problem fine-tuning match full RL pipeline reasoning gains?
- Can inference budgets be allocated adaptively based on prompt difficulty?
- Should prompt design and inference scaling be optimized together or separately?
- Does policy entropy collapse prevent inference-time search from finding solutions?
- Is reasoning failure caused by task complexity or training distribution gaps?
- Can test-time compute scaling substitute for larger model parameters?
- What architectural variables most improve inference efficiency today?
- Where does the generation-verification gap appear in test-time compute?
- Are reasoning models more vulnerable to adversarial manipulation than standard models?
- How do search and reasoning workflows improve forecasting performance over base models?
- Why does reasoning backward enable better forward reasoning performance?
- Does task diversity in pretraining data transfer reasoning better than larger models?
- Can a two-layer network outgeneralize billion-parameter models through recursion alone?
- Can decoding-time prompting strategies fully replace diversity-focused training methods?
- Can scaling data alone solve performance gaps on long-tail concepts?
- How does the inference steps dial compare to test-time compute trade-offs in language models?
- How should evaluation frameworks account for the computational cost of frontier AI capability?
- Why does architecture matter more than training compute for inference efficiency?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can inference compute replace scaling up model size?
Explores whether smaller models given more thinking time during inference can match larger models. Matters because it reshapes deployment economics and compute allocation strategies.
the limit of this substitution
-
How do internal and external test-time scaling compare?
Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
internal TTS addresses this gap through training
-
What makes test-time training actually work in practice?
Test-time training achieved striking gains on ARC tasks, but which components are truly essential? This explores what happens when you remove each of the three key ingredients.
TTT is the bridge: it updates parameters at test time, potentially narrowing the gap between training regimes without full retraining
-
Can language models learn grammar from child-scale data?
If models trained on ~100 million words—roughly what children experience—can match human syntactic performance, what does that tell us about what data volume is actually necessary for learning grammar?
contrast: syntactic competence doesn't require specialized training; reasoning does — reveals that training-regime dependencies are capability-specific
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
tension with X — both claim training matters, but differ on whether reasoning is created or activated: this note says reasoning models have internalized something non-reasoning models lack; the latent-capability finding shows base models already contain reasoning behaviors that minimal signals (steering, decoding tweaks) unlock; the gap may be activation, not capability
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning from verifiable rewards teaches models genuinely new reasoning skills or simply makes existing capabilities more reliable. Pass@k analysis suggests the latter.
refines the claim: pass@k analysis shows RLVR reasoning models do not actually exceed their base model at high k — they just sample more efficiently at low k; the "non-reasoning vs reasoning" gap may be partly a sampling-efficiency gap that high-k aggregation reveals
-
Does RL post-training create reasoning or just deploy it?
Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
reframes the mechanism: the gap is not that reasoning models "internalize the reasoning protocol" but that they have learned *when* to activate latent reasoning circuitry; non-reasoning models have the circuitry but lack the activation policy
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Reasoning Models Can Be Effective Without Thinking
- ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- On the Reasoning Capacity of AI Models and How to Quantify It
- Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
- Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
Original note title
non-reasoning models cannot match reasoning models even with unlimited inference budget