Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
The standard model of test-time compute treats each query as stateless — context and query arrive together, model thinks, response is generated. But most real LLM applications are stateful: a coding agent operates on a persistent repository, a document QA system uses the same documents across many questions, a conversational assistant maintains an ongoing history.
Sleep-time compute exploits this statefulness. Between interactions — when the model would otherwise be idle — it can pre-compute inferences about the context: anticipated questions, architectural patterns in code, likely debugging paths. At query time, these pre-computed inferences are provided alongside the prompt, allowing the model to respond with far less latency while maintaining the accuracy of heavier compute.
The economic logic is amortization. If multiple queries share the same context, any sleep-time compute applied to that context is amortized across all those queries. The per-query cost drops even as total accuracy is preserved.
This reframes the design question: instead of "how much compute should the model use when answering?", the question becomes "when should compute happen?" — and the answer is often before the user asks, not during. See the writing angle When should AI systems do their thinking?.
Think-in-Memory as conversational sleep-time compute (2311.08719): TiM applies the sleep-time principle to conversational memory. After generating a response, the agent post-thinks — integrating historical and new thoughts to update an evolved memory using insert/forget/merge operations. Future queries retrieve pre-reasoned thoughts rather than re-deriving them from raw history. This eliminates inconsistent reasoning paths (different conclusions from the same evidence recalled for different questions) by ensuring reasoning about history happens once and persists. The memory evolves through explicit operations rather than accumulating raw context, making it a concrete implementation of sleep-time compute for the multi-turn conversation use case. See Can storing evolved thoughts prevent inconsistent reasoning in conversations?.
Inquiring lines that use this note as a source 1
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
Related concepts in this collection 8
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
a complementary rethinking of how to allocate compute
-
Can non-reasoning models catch up with more compute?
Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
both show deployment context (statefulness / training regime) matters as much as raw compute
-
How do internal and external test-time scaling compare?
Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
sleep-time compute is a third category: pre-interaction TTS that fits neither internal nor external
-
Can neural memory modules scale language models beyond attention limits?
Can separating short-term attention from adaptive long-term memory allow models to efficiently handle context windows exceeding 2M tokens while maintaining competitive performance?
Titans' persistent memory architecture is a natural implementation substrate for sleep-time compute: the adaptive memory can store precomputed inferences that persist across interactions, and its surprise-based update mechanism naturally prioritizes novel precomputed insights
-
Can decoding-time tuning preserve knowledge better than weight fine-tuning?
Explores whether applying alignment signals at inference time rather than modifying model weights can better preserve the factual knowledge learned during pretraining while still achieving alignment goals.
complementary inference-time adaptation: proxy-tuning applies domain adaptation at decoding time without weight modification, sleep-time compute applies reasoning pre-computation between interactions; both demonstrate that significant model behavior changes can be achieved without retraining
-
Can models treat long prompts as external code environments?
Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?
complementary reframing: sleep-time compute separates WHEN to process context (before vs. during query), while RLMs separate WHERE context lives (external environment vs. context window); both reject the default of stuffing everything into the window at query time
-
Can long-context models resolve retriever-reader imbalance?
Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
parallel rebalancing of the retrieval pipeline: LongRAG shifts work from retriever to reader within a single query; sleep-time compute shifts work from query time to pre-query time; both challenge the assumption that query-time retrieval is where intelligence must concentrate
-
Can recurrence consolidate memory without predicting tokens?
Recurrent neural networks typically use recurrence only for prediction. But could offline recurrent passes serve a second purpose—consolidating transient context into persistent weights, like sleep does in brains?
extends: consolidation is a second offline use of sleep-phase compute beyond precomputing answers
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Sleep-time Compute: Beyond Inference Scaling at Test-time
- What Characterizes Effective Reasoning? Revisiting Length, Review, and Structure of CoT
- Language Models Need Sleep
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
- Reasoning Models Can Be Effective Without Thinking
- Recursive Language Models
- Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking
- Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Original note title
sleep-time compute reduces test-time latency by precomputing over stateful context