SYNTHESIS NOTE

Can models precompute answers before users ask questions?

Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?

Synthesis note · 2026-02-20 · sourced from Test Time Compute

The standard model of test-time compute treats each query as stateless — context and query arrive together, model thinks, response is generated. But most real LLM applications are stateful: a coding agent operates on a persistent repository, a document QA system uses the same documents across many questions, a conversational assistant maintains an ongoing history.

Sleep-time compute exploits this statefulness. Between interactions — when the model would otherwise be idle — it can pre-compute inferences about the context: anticipated questions, architectural patterns in code, likely debugging paths. At query time, these pre-computed inferences are provided alongside the prompt, allowing the model to respond with far less latency while maintaining the accuracy of heavier compute.

The economic logic is amortization. If multiple queries share the same context, any sleep-time compute applied to that context is amortized across all those queries. The per-query cost drops even as total accuracy is preserved.

This reframes the design question: instead of "how much compute should the model use when answering?", the question becomes "when should compute happen?" — and the answer is often before the user asks, not during. See the writing angle When should AI systems do their thinking?.

Think-in-Memory as conversational sleep-time compute (2311.08719): TiM applies the sleep-time principle to conversational memory. After generating a response, the agent post-thinks — integrating historical and new thoughts to update an evolved memory using insert/forget/merge operations. Future queries retrieve pre-reasoned thoughts rather than re-deriving them from raw history. This eliminates inconsistent reasoning paths (different conclusions from the same evidence recalled for different questions) by ensuring reasoning about history happens once and persists. The memory evolves through explicit operations rather than accumulating raw context, making it a concrete implementation of sleep-time compute for the multi-turn conversation use case. See Can storing evolved thoughts prevent inconsistent reasoning in conversations?.

Inquiring lines that read this note 1

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

Can inference-time compute substitute for scaling up model parameters?

How does precomputing context reasoning reduce latency in stateful applications?

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

18 direct connections · 159 in 2-hop network ·medium cluster Open in graph ↗

Can models precompute answers before users ask q… Can we allocate inference compute based on prompt … Can non-reasoning models catch up with more comput… How do internal and external test-time scaling com… Can neural memory modules scale language models be… Can decoding-time tuning preserve knowledge better… Can models treat long prompts as external code env… Can long-context models resolve retriever-reader i… Can recurrence consolidate memory without predicti…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
a complementary rethinking of how to allocate compute
Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
both show deployment context (statefulness / training regime) matters as much as raw compute
How do internal and external test-time scaling compare? Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
sleep-time compute is a third category: pre-interaction TTS that fits neither internal nor external
Can neural memory modules scale language models beyond attention limits? Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans' persistent memory architecture is a natural implementation substrate for sleep-time compute: the adaptive memory can store precomputed inferences that persist across interactions, and its surprise-based update mechanism naturally prioritizes novel precomputed insights
Can decoding-time tuning preserve knowledge better than weight fine-tuning? Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
complementary inference-time adaptation: proxy-tuning applies domain adaptation at decoding time without weight modification, sleep-time compute applies reasoning pre-computation between interactions; both demonstrate that significant model behavior changes can be achieved without retraining
Can models treat long prompts as external code environments? Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?
complementary reframing: sleep-time compute separates WHEN to process context (before vs. during query), while RLMs separate WHERE context lives (external environment vs. context window); both reject the default of stuffing everything into the window at query time
Can long-context models resolve retriever-reader imbalance? Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
parallel rebalancing of the retrieval pipeline: LongRAG shifts work from retriever to reader within a single query; sleep-time compute shifts work from query time to pre-query time; both challenge the assumption that query-time retrieval is where intelligence must concentrate
Can recurrence consolidate memory without predicting tokens? Recurrent neural networks typically use recurrence only for prediction. But could offline recurrent passes serve a second purpose—consolidating transient context into persistent weights, like sleep does in brains?
extends: consolidation is a second offline use of sleep-phase compute beyond precomputing answers

Can models precompute answers before users ask questions?

Inquiring lines that read this note 1

Related concepts in this collection 8

Related papers in this collection 8

Search by related questions 4