Can models treat long prompts as external code environments?
Do language models handle vastly longer inputs by offloading context to a Python REPL and querying it programmatically, rather than fitting everything into the transformer's attention window?
Context rot — quality degradation as context lengthens — affects even frontier models like GPT-5. Extending context windows is an arms race: each increase buys more capacity but doesn't solve the fundamental problem that attention-based processing degrades with length. Recursive Language Models sidestep this entirely by changing where the context lives.
The key insight: long prompts should not be fed into the transformer directly. Instead, they should be treated as part of an external environment that the model can symbolically interact with. In the RLM implementation, the prompt is stored as a variable in a Python REPL. The model reads, filters, chunks, and queries its context through code execution rather than token-space attention.
Two mechanisms make this work:
Model priors enable context filtering without seeing it. The model uses its existing knowledge to construct targeted queries — regex searches for keywords, printing specific line ranges to inspect, narrowing the search space based on task understanding. It doesn't need to attend to 100K tokens to find the relevant 500. This is analogous to how humans skim a long document: prior knowledge guides where to look.
Recursive sub-calls defer unbounded reasoning chains. When the context requires reasoning over multiple chunks, the model spawns sub-RLM calls, each operating on a manageable portion. The decomposition is dynamic — the model decides how to partition based on what it observes, not a predefined chunking strategy.
Results: RLMs handle inputs up to two orders of magnitude beyond model context windows. On shorter prompts (within context limits), RLMs still dramatically outperform base models and common long-context scaffolds including context compaction. The cost is comparable or cheaper per query because the model processes only the relevant portions of context rather than attending to everything.
This connects to Can models precompute answers before users ask questions? as a second reframing of compute allocation: sleep-time asks WHEN to compute (before vs during query); RLMs ask WHERE to keep the data (model's context vs external environment). Both reject the default of "stuff everything into the context window at query time."
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does token generation as flow differ from print's archival storage?
- What happens to anaphoric reference when context exceeds the window?
- How does prompt optimization differ from building persistent activation context?
- Can context windows and RAG actually change what language models generate?
- How do smaller models respond to longer reflection prompts?
- How do language agents implement prompts as executable computational graphs?
- Can algorithmic control flow over prompts simulate traditional programming languages?
- How does trajectory filtering handle noise when language models use code execution tools?
- Is prompt engineering a workaround rather than a capability fix?
- Why do longer context windows alone fail to capture temporal dynamics in dialogue?
- What makes extended chains more vulnerable than standard prompts?
- How does decomposed prompting formalize prompt libraries as reusable software modules?
- Why does sandboxed execution matter more than monolithic prompting?
- How does separating local and global context dependencies affect long-context performance?
- Why does teacher forcing fail to capture long-range dependencies?
- How do external prompt artifacts improve agent behavior compared to inline instructions?
- Why do LLMs degrade on long inputs before hitting context limits?
Related concepts in this collection 5
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can models precompute answers before users ask questions?
Most LLM applications maintain persistent state across interactions. Could models use idle time between queries to precompute useful inferences about that context, reducing latency when users actually ask?
parallel reframing: sleep-time is temporal (when to compute), RLMs are spatial (where to keep data); both reject the default context-stuffing approach
-
How do internal and external test-time scaling compare?
Explores whether test-time scaling approaches fundamentally differ in where compute is spent: during training (internal) versus at inference (external). Understanding this split clarifies the trade-offs in deployment strategy and reasoning capability.
RLMs are a novel form of external TTS: compute spent on environmental interaction rather than model-internal reasoning
-
Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
RLMs address this directly by offloading context to environment; the model only attends to relevant fragments
-
Can long-context models resolve retriever-reader imbalance?
Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
RLMs take the opposite approach: shift burden to retrieval (code-based context probing) rather than reading (attention over everything)
-
When should AI systems do their thinking?
Most AI inference happens when users ask questions, but what if models could think during idle time instead? This explores whether shifting inference to before queries arrive could fundamentally change system design.
parallel temporal reframing: sleep-time asks WHEN to compute, RLMs ask WHERE to keep data; both reject the assumption that all processing must happen inside the context window at query time
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Recursive Language Models
- Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models
- Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- Flows: Building Blocks of Reasoning and Collaborating AI
- ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
- MCP-Zero: Proactive Toolchain Construction for LLM Agents from Scratch
- Long-context LLMs Struggle with Long In-context Learning
- How Many Instructions Can LLMs Follow at Once?
Original note title
recursive language models treat long prompts as external environment enabling programmatic interaction 100x beyond context windows