SYNTHESIS NOTE

Can models treat long prompts as external code environments?

Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?

Synthesis note · 2026-02-23 · sourced from Inference time scaling

Context rot — quality degradation as context lengthens — affects even frontier models like GPT-5. Extending context windows is an arms race: each increase buys more capacity but doesn't solve the fundamental problem that attention-based processing degrades with length. Recursive Language Models sidestep this entirely by changing where the context lives.

The key insight: long prompts should not be fed into the transformer directly. Instead, they should be treated as part of an external environment that the model can symbolically interact with. In the RLM implementation, the prompt is stored as a variable in a Python REPL. The model reads, filters, chunks, and queries its context through code execution rather than token-space attention.

Two mechanisms make this work:

Model priors enable context filtering without seeing it. The model uses its existing knowledge to construct targeted queries — regex searches for keywords, printing specific line ranges to inspect, narrowing the search space based on task understanding. It doesn't need to attend to 100K tokens to find the relevant 500. This is analogous to how humans skim a long document: prior knowledge guides where to look.

Recursive sub-calls defer unbounded reasoning chains. When the context requires reasoning over multiple chunks, the model spawns sub-RLM calls, each operating on a manageable portion. The decomposition is dynamic — the model decides how to partition based on what it observes, not a predefined chunking strategy.

Results: RLMs handle inputs up to two orders of magnitude beyond model context windows. On shorter prompts (within context limits), RLMs still dramatically outperform base models and common long-context scaffolds including context compaction. The cost is comparable or cheaper per query because the model processes only the relevant portions of context rather than attending to everything.

This connects to Can models precompute answers before users ask questions? as a second reframing of compute allocation: sleep-time asks WHEN to compute (before vs during query); RLMs ask WHERE to keep the data (model's context vs external environment). Both reject the default of "stuff everything into the context window at query time."

Inquiring lines that read this note 18

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do prompt structure and constraints affect model instruction reliability?

Why do language models struggle with implicit discourse relations?

What happens to anaphoric reference when context exceeds the window?

Can prompting inject entirely new knowledge into language models?

When should retrieval-augmented systems decide to fetch new information?

Can context windows and RAG actually change what language models generate?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does trajectory filtering handle noise when language models use code execution tools?

How should dialogue recommender systems manage conversation history and state?

Why do longer context windows alone fail to capture temporal dynamics in dialogue?

Why do correct reasoning traces tend to be shorter than incorrect ones?

What makes extended chains more vulnerable than standard prompts?

What memory architectures best support persistent reasoning across extended interactions?

How do training priors constrain what context information can override?

Why does teacher forcing fail to capture long-range dependencies?

How should agents balance memory condensation to optimize context efficiency?

How do external prompt artifacts improve agent behavior compared to inline instructions?

What critical LLM failures do standard benchmarks hide?

Why do LLMs degrade on long inputs before hitting context limits?

Related concepts in this collection 5

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 131 in 2-hop network ·medium cluster Open in graph ↗

Can models treat long prompts as external code e… Can models precompute answers before users ask que… How do internal and external test-time scaling com… Does reasoning ability actually degrade with longe… Can long-context models resolve retriever-reader i… When should AI systems do their thinking?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can models precompute answers before users ask questions? Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
parallel reframing: sleep-time is temporal (when to compute), RLMs are spatial (where to keep data); both reject the default context-stuffing approach
How do internal and external test-time scaling compare? Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
RLMs are a novel form of external TTS: compute spent on environmental interaction rather than model-internal reasoning
Does reasoning ability actually degrade with longer inputs? Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
RLMs address this directly by offloading context to environment; the model only attends to relevant fragments
Can long-context models resolve retriever-reader imbalance? Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
RLMs take the opposite approach: shift burden to retrieval (code-based context probing) rather than reading (attention over everything)
When should AI systems do their thinking? Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
parallel temporal reframing: sleep-time asks WHEN to compute, RLMs ask WHERE to keep data; both reject the assumption that all processing must happen inside the context window at query time

Can models treat long prompts as external code environments?

Inquiring lines that read this note 18

Related concepts in this collection 5

Related papers in this collection 8

Search by related questions 4