SYNTHESIS NOTE
Model Architecture and Internals Reasoning, Retrieval, and Evaluation Training, RL, and Test-Time Scaling

Can models precompute answers before users ask questions?

Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?

Synthesis note · 2026-02-20 · sourced from Test Time Compute
How should we allocate compute budget at inference time?

The standard model of test-time compute treats each query as stateless — context and query arrive together, model thinks, response is generated. But most real LLM applications are stateful: a coding agent operates on a persistent repository, a document QA system uses the same documents across many questions, a conversational assistant maintains an ongoing history.

Sleep-time compute exploits this statefulness. Between interactions — when the model would otherwise be idle — it can pre-compute inferences about the context: anticipated questions, architectural patterns in code, likely debugging paths. At query time, these pre-computed inferences are provided alongside the prompt, allowing the model to respond with far less latency while maintaining the accuracy of heavier compute.

The economic logic is amortization. If multiple queries share the same context, any sleep-time compute applied to that context is amortized across all those queries. The per-query cost drops even as total accuracy is preserved.

This reframes the design question: instead of "how much compute should the model use when answering?", the question becomes "when should compute happen?" — and the answer is often before the user asks, not during. See the writing angle When should AI systems do their thinking?.

Think-in-Memory as conversational sleep-time compute (2311.08719): TiM applies the sleep-time principle to conversational memory. After generating a response, the agent post-thinks — integrating historical and new thoughts to update an evolved memory using insert/forget/merge operations. Future queries retrieve pre-reasoned thoughts rather than re-deriving them from raw history. This eliminates inconsistent reasoning paths (different conclusions from the same evidence recalled for different questions) by ensuring reasoning about history happens once and persists. The memory evolves through explicit operations rather than accumulating raw context, making it a concrete implementation of sleep-time compute for the multi-turn conversation use case. See Can storing evolved thoughts prevent inconsistent reasoning in conversations?.

Inquiring lines that use this note as a source 1

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 8

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
18 direct connections · 155 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

sleep-time compute reduces test-time latency by precomputing over stateful context