How do internal and external test-time scaling compare?
Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
Every test-time scaling approach belongs to one of two categories:
- Internal TTS: Train the model so it generates long chain-of-thought reasoning autonomously, without external scaffolding. Requires SFT on long CoT data, RL to reinforce reasoning, or TTT (parameter updates at inference). The model self-organizes compute allocation. Examples: o1, DeepSeek-R1, QwQ.
- External TTS: Use inference-time infrastructure — search algorithms, verifiers, reward models — to steer a base model toward better outputs. The model's parameters are unchanged; compute is spent on search and evaluation. Examples: Best-of-N with PRM, MCTS, beam search, majority voting.
Internal and external TTS are complementary, not competing: internal TTS makes models better reasoners; external TTS extracts more performance from whatever reasoning capability exists. Combining them (e.g., using Best-of-N to boost a long-CoT model with a PRM) often outperforms either alone.
The practical distinction matters for deployment: internal scaling is a training cost paid once; external scaling is an inference cost paid per query. The economics push toward internal scaling at scale, but external scaling remains essential during development when training is expensive.
The finding that Can non-reasoning models catch up with more compute? illustrates the limits of external TTS alone: you need the internal foundation before external scaling can amplify it.
Inquiring lines that use this note as a source 35
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How do routing and test-time compute scaling work together as optimization axes?
- Does test-time compute actually substitute for having larger model parameters?
- What is the trade-off between parallel and sequential scaling at test time?
- How do sub-token and architecture-level compute optimization strategies compare?
- Can offline context optimization reduce test-time latency like sleep-time compute?
- How does the three-component definition apply to test-time scaling laws?
- Can test-time scaling prioritize genuine reasoning over pattern matching?
- How does reward function accuracy affect the efficiency of test-time compute allocation?
- How does test-time compute substitute for model parameter scaling?
- Can test-time compute on smaller models replace larger model inference?
- How do conditional scaling laws incorporate hardware into architecture choices?
- What mechanisms drive test-time compute allocation in reasoning tasks?
- Does test-time compute scaling work for agentic deep research tasks?
- How does test-time scaling relate to token budget in agentic deep research?
- Do models excel at reasoning depth or memory breadth when scaling test time compute?
- What are the computational trade-offs between training-time vs inference-time consistency correction?
- Can test-time compute allocation shift from solutions to strategies?
- What test-time strategies did o3 discover without human specification?
- How does task structure determine optimal test-time compute allocation?
- Where does sleep-time compute fit in the taxonomy of test-time scaling?
- How do internal versus external test-time scaling approaches differ from precomputation strategies?
- How much task-similar finetuning data does test-time training actually need?
- Does sparse parameter updating improve test-time training's computational cost?
- When is 15x token overhead actually worth the compute cost?
- Can memory and test-time compute scale together as a single axis?
- How does test-time verification decouple the act of checking from reasoning generation?
- Can test-time scaling work through retrieval rather than reasoning?
- How do external invocation latencies drive technique convergence?
- Can test-time compute fully replace scaling model parameters on hard problems?
- How does spending offline compute affect wake-time prediction latency?
- How should we measure and report serial compute separately?
- Can test-time compute scaling substitute for larger model parameters?
- Where does the generation-verification gap appear in test-time compute?
- How do coverage and identifiability set separate performance ceilings?
- Should agents use parallel or sequential scaling during test time?
Related concepts in this collection 7
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
the limit of external TTS without internal foundation
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
a cross-cutting axis that applies within each category
-
Can retrieval be extended into multi-step chains like reasoning?
Standard RAG retrieves once, but multi-hop tasks need intermediate steps. Can we train models to plan retrieval sequences the way chain-of-thought trains reasoning, and scale retrieval at test time?
CoRAG is a hybrid that escapes the internal/external binary: training teaches chain generation (internal) while compute dials (chain length/count) are applied at inference (external); retrieval-intensive tasks have their own TTS curve that this taxonomy did not originally capture
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
sleep-time compute fractures the dichotomy by adding a third temporal position: pre-interaction compute is neither internal (weights trained) nor external (inference-time search) but amortized pre-computation; the binary taxonomy needs a third category
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
challenges the taxonomy: latent recurrent depth-scaling is internal (architectural recurrence) but applied at inference (external compute dial), occupying a hybrid position the binary did not anticipate; verbalization is orthogonal to the internal/external split
-
Does RL post-training create reasoning or just deploy it?
Investigates whether reasoning capability emerges during RL fine-tuning or already exists in base models. Matters because it reshapes how we build and optimize reasoning systems.
reframes "internal TTS": if RL teaches *when* to activate latent capability rather than how to reason, then "internal TTS" is more accurately deployment-timing optimization than capability instillation; the foundation that external TTS amplifies was already in the base model
-
Can modular cognitive tools unlock reasoning without training?
Can reasoning capabilities be elicited by structuring LLM calls as isolated cognitive operations—understanding, recalling, examining, and backtracking—rather than through reinforcement learning?
third-category instance: cognitive tools elicit reasoning at inference time without weight updates AND without external search infrastructure — neither internal nor external in the original sense; the taxonomy needs to distinguish "trained to reason" from "scaffolded to reason"
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Test-Time Scaling with Reflective Generative Model
- Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning
- A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
- Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models
- RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
- Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?
- Retrieval-augmented reasoning with lean language models
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Original note title
internal vs external tts is the primary taxonomic split in test-time scaling research