INQUIRING LINE

How does spending offline compute affect wake-time prediction latency?

This explores the trade-off between compute you spend ahead of time (training, augmentation, precomputation) and the latency you pay at the moment of prediction — and whether front-loading work actually buys you faster, better answers when the model 'wakes up' to respond.


This explores how compute spent offline — during training or precomputation — changes what you pay at the moment of prediction, when the system 'wakes up' to answer. The corpus frames this as a budget you can shift between two pools that turn out not to be independent. Can inference compute replace scaling up model size? is the cleanest statement of the tension: on hard prompts, a smaller model given more inference compute can match a larger one, which means pretraining and inference compute trade against each other rather than each doing their own job. So in principle you can move work earlier (bigger/better-trained model) to spend less at wake time, or move it later (more thinking tokens) to keep the model small.

But the corpus pushes back on the idea that the trade is symmetric. Can non-reasoning models catch up with more compute? finds that what you bake in offline can be qualitatively decisive — a model trained with a reasoning protocol stays ahead no matter how much inference budget a non-reasoning model is handed, because training is what makes those extra wake-time tokens productive in the first place. Can training data augmentation match test-time compute scaling benefits? shows the same move from the other direction: by spending offline compute to generate reasoning traces and fold them into pretraining, you get 3x data efficiency and harder tokens automatically absorb longer traces — essentially relocating test-time scaling into the training phase so the model arrives already 'pre-thought.'

The sharpest lens on the prediction-latency side is the internal-versus-external split in How do internal and external test-time scaling compare?: internal scaling (training models to reason autonomously) builds capability offline, while external scaling (search, verification, sampling at inference) extracts performance at wake time but adds latency. They complement rather than compete. And Does the choice of reasoning framework actually matter for test-time performance? adds a humbling caveat — once you're spending at inference, the specific algorithm (best-of-N vs MCTS) barely matters; total compute and the quality of your value function dominate. So the offline question becomes: did you train a good enough verifier/reward signal to make wake-time search worth its latency?

There's also a hardware-level version of this trade that's easy to miss. Does recomputing weights cost less than moving them on mobile? shows that on memory-bound mobile devices, recomputing a transformer block twice is actually *faster* than fetching separate weights — the bottleneck at wake time is moving data, not doing math. That inverts the usual intuition: spending more compute at prediction time can lower latency when memory movement is the real cost. The precomputation limit shows up vividly in How can real-time recommendations stay responsive and reproducible?, where Netflix can't precompute anything that depends on signals arriving mid-session, forcing runtime recomputation that raises call volume and timeout risk — a concrete case where you simply *can't* move the work offline.

The thing you didn't know you wanted to know: offline compute doesn't just 'speed up' wake-time prediction — it changes whether wake-time spending pays off at all. Training installs the protocol that makes inference tokens productive (Can non-reasoning models catch up with more compute?), and the deepest answers happen when both pools are tuned together rather than traded one-for-one.


Sources 7 notes

Can inference compute replace scaling up model size?

Snell et al. (2024) showed that inference-time compute trades off against model parameter scaling, especially on difficult prompts. This reveals pretraining and inference compute are not independent resources.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can training data augmentation match test-time compute scaling benefits?

Augmenting pretraining data with LLM-generated reasoning traces improves data efficiency 3x and reasoning benchmark performance 10%+ for 3B models. Harder tokens automatically receive longer traces, creating a natural compute-allocation mechanism analogous to test-time scaling.

How do internal and external test-time scaling compare?

Research shows test-time scaling methods split into internal (training models for autonomous reasoning) and external (inference-time search and verification). They complement rather than compete; internal builds capability while external extracts performance from existing capability.

Does the choice of reasoning framework actually matter for test-time performance?

Information-theoretic analysis shows BoN and MCTS converge in reasoning accuracy when controlling for total compute. Snowball errors accumulate per step regardless of framework; mitigation depends on search scope and reward function reliability, not the specific algorithm.

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

How can real-time recommendations stay responsive and reproducible?

Netflix's in-session adaptation improves ranking by 6% relative, but precomputing is impossible when signals arrive mid-session. This forces runtime recomputation, increasing call volume, timeout risk, and making bugs harder to reproduce.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst. The question remains open: **How does spending offline compute affect wake-time prediction latency?** This is not settled; constraints may have shifted.

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026. A library curated around this question identified:
- Test-time and pretraining compute trade off: smaller models + more inference can match larger ones on hard prompts, BUT only if trained with reasoning protocols — untrained models cannot (2025).
- Training reasoning traces into pretraining yields 3× data efficiency; hard tokens automatically absorb longer traces, relocating test-time scaling into the training phase (2025).
- Internal scaling (training autonomy offline) vs. external scaling (search/verification at wake time) complement rather than compete; external scaling latency depends on total compute and verifier quality, not algorithm choice (2025).
- On memory-bound devices, recomputing blocks is faster than fetching weights — inverting the usual trade, making offline precomputation less valuable than wake-time recomputation when memory movement is the bottleneck (2024).
- Real-time signals arriving mid-session force runtime recomputation that cannot be precomputed, raising latency unavoidably (2022).

Anchor papers (verify; mind their dates):
- arXiv:2305.20050 (2023): Step-by-step verification
- arXiv:2402.14905 (2024): Mobile LLM weight optimization
- arXiv:2501.15602 (2025): External slow-thinking efficacy
- arXiv:2509.20186 (2025): Thinking augmented pre-training

Your task:
(1) **RE-TEST EACH CONSTRAINT.** For each claim above, ask: have newer models (o1, o3, Claude 3.5+), training methods (RL scaling, synthetic reasoning data), inference harnesses (token caching, speculative execution, batching), or device hardware (quantization, KV-cache compression) since relaxed it? Flag what still holds and what may have shifted. Separate the durable question (offline–wake-time trade structure) from perishable limits (specific model scales, device specs).
(2) **Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months.** Does any recent paper argue the offline/wake-time split is a false dichotomy, or show one pool dominates?
(3) **Propose 2 research questions that ASSUME the regime may have moved:** e.g., *If reasoning is now trainable at scale, does precomputation still solve latency on real workloads?* or *Does hardware-level memory movement still dominate latency on current mobile/edge silicon?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines