SYNTHESIS NOTE

How do internal and external test-time scaling compare?

Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.

Synthesis note · 2026-02-20 · sourced from Test Time Compute

Every test-time scaling approach belongs to one of two categories:

Internal TTS: Train the model so it generates long chain-of-thought reasoning autonomously, without external scaffolding. Requires SFT on long CoT data, RL to reinforce reasoning, or TTT (parameter updates at inference). The model self-organizes compute allocation. Examples: o1, DeepSeek-R1, QwQ.
External TTS: Use inference-time infrastructure — search algorithms, verifiers, reward models — to steer a base model toward better outputs. The model's parameters are unchanged; compute is spent on search and evaluation. Examples: Best-of-N with PRM, MCTS, beam search, majority voting.

Internal and external TTS are complementary, not competing: internal TTS makes models better reasoners; external TTS extracts more performance from whatever reasoning capability exists. Combining them (e.g., using Best-of-N to boost a long-CoT model with a PRM) often outperforms either alone.

The practical distinction matters for deployment: internal scaling is a training cost paid once; external scaling is an inference cost paid per query. The economics push toward internal scaling at scale, but external scaling remains essential during development when training is expensive.

The finding that Can non-reasoning models catch up with more compute? illustrates the limits of external TTS alone: you need the internal foundation before external scaling can amplify it.

Inquiring lines that read this note 38

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can model routing outperform monolithic scaling as an efficiency strategy?

How do routing and test-time compute scaling work together as optimization axes?

Can inference-time compute substitute for scaling up model parameters?

Does parallel reasoning outperform sequential thinking under fixed compute budgets?

When does architectural design matter more than raw model capacity?

How do sub-token and architecture-level compute optimization strategies compare?

What actually drives chain-of-thought reasoning improvements in language models?

How does the three-component definition apply to test-time scaling laws?

What properties determine whether reward signals teach genuine reasoning?

How does reward function accuracy affect the efficiency of test-time compute allocation?

Do autonomous architecture discoveries follow predictable scaling laws?

How do conditional scaling laws incorporate hardware into architecture choices?

How do knowledge injection methods compare across cost and effectiveness?

What are the computational trade-offs between training-time vs inference-time consistency correction?

How should inference compute be adaptively allocated based on prompt difficulty?

Can test-time compute allocation shift from solutions to strategies?

What capability tradeoffs emerge when scaling model reasoning abilities?

What test-time strategies did o3 discover without human specification?

How does example difficulty affect learning efficiency in language models?

How much task-similar finetuning data does test-time training actually need?

Why does finetuning cause catastrophic forgetting of model capabilities?

Does sparse parameter updating improve test-time training's computational cost?

What drives capability and cost efficiency in agent systems?

When is 15x token overhead actually worth the compute cost?

Why does verification consistently lag behind AI generation?

Does decoupling planning from execution improve multi-step reasoning accuracy?

How do external invocation latencies drive technique convergence?

How can identical external performance mask different internal representations?

How do coverage and identifiability set separate performance ceilings?

Related concepts in this collection 7

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

22 direct connections · 212 in 2-hop network ·dense cluster Open in graph ↗

How do internal and external test-time scaling c… Can non-reasoning models catch up with more comput… How should we balance parallel versus sequential c… Can retrieval be extended into multi-step chains l… Can models precompute answers before users ask que… Can models reason without generating visible think… Does RL post-training create reasoning or just dep… Can modular cognitive tools unlock reasoning witho…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
the limit of external TTS without internal foundation
How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
a cross-cutting axis that applies within each category
Can retrieval be extended into multi-step chains like reasoning? Standard RAG retrieves once, but multi-hop tasks need intermediate steps. Can we train models to plan retrieval sequences the way chain-of-thought trains reasoning, and scale retrieval at test time?
CoRAG is a hybrid that escapes the internal/external binary: training teaches chain generation (internal) while compute dials (chain length/count) are applied at inference (external); retrieval-intensive tasks have their own TTS curve that this taxonomy did not originally capture
Can models precompute answers before users ask questions? Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
sleep-time compute fractures the dichotomy by adding a third temporal position: pre-interaction compute is neither internal (weights trained) nor external (inference-time search) but amortized pre-computation; the binary taxonomy needs a third category
Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
challenges the taxonomy: latent recurrent depth-scaling is internal (architectural recurrence) but applied at inference (external compute dial), occupying a hybrid position the binary did not anticipate; verbalization is orthogonal to the internal/external split
Does RL post-training create reasoning or just deploy it? Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
reframes "internal TTS": if RL teaches *when* to activate latent capability rather than how to reason, then "internal TTS" is more accurately deployment-timing optimization than capability instillation; the foundation that external TTS amplifies was already in the base model
Can modular cognitive tools unlock reasoning without training? Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?
third-category instance: cognitive tools elicit reasoning at inference time without weight updates AND without external search infrastructure — neither internal nor external in the original sense; the taxonomy needs to distinguish "trained to reason" from "scaffolded to reason"

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

internal vs external tts is the primary taxonomic split in test-time scaling research

How do internal and external test-time scaling compare?

Inquiring lines that read this note 38

Related concepts in this collection 7

Related papers in this collection 8

Search by related questions 4