SYNTHESIS NOTE

Can language models learn to simulate agent environments?

Explores whether training language models to predict next states across diverse agent domains can create transferable world models that improve agent performance beyond real-world interaction alone.

Synthesis note · 2026-06-27 · sourced from LLM Architecture

The agent-environment loop has two components: the policy (states → actions) and the world model ((states, actions) → next states). Qwen-AgentWorld's framing is that LLM-agent research has obsessed over the policy and almost entirely neglected the world model — and that this is a load-bearing gap, citing Richens et al.'s result that any agent generalizing broadly must have learned a world model. The contribution is a native language world model (35B-A3B and 397B-A17B) that simulates agentic environments across seven domains via long chain-of-thought, trained on 10M+ real interaction trajectories through a three-stage "CPT injects, SFT activates, RL sharpens" recipe targeting next-state prediction.

Two claims earn their keep. First, as a decoupled simulator, controllable simulation beats both uncontrolled simulation and — strikingly — real-environment training on three agentic benchmarks, suggesting a trained world model can manufacture cleaner, more targeted experience than reality supplies. Second, as a foundation, LWM warm-up improves downstream agent performance across all seven tasks via cross-domain transfer, positioning next-state prediction as a transferable pretraining objective for agents the way next-token prediction is for language.

This refines the conceptual debates. It operationalizes What should a world model actually be designed to do? — though notably Qwen-AgentWorld's mechanism is next-state prediction, so it either contradicts that essay's "not next observation" claim or shows that purposeful simulation can be built out of next-state prediction at sufficient scale and reasoning depth. It also bears on Do LLMs actually have world models or just facts?: a model that controllably simulates state transitions is reaching for the mechanistic (Sense 2) reading, but benchmark success on seven curated domains does not prove a compact generative model of mechanisms versus sophisticated interpolation over 10M trajectories. The strongest counterargument is the simulation-fidelity ceiling: an agent trained on simulated rather than real experience inherits the world model's errors, and "surpassing real-environment training" may hold only where the simulator's blind spots happen not to matter.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

13 direct connections · 110 in 2-hop network ·medium cluster Open in graph ↗

Can language models learn to simulate agent envi… What should a world model actually be designed to … Do LLMs actually have world models or just facts? What five design choices compose a world model?

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

What should a world model actually be designed to do? Current AI research treats world models as either video predictors or RL dynamics learners, but what if their real purpose is simulating actionable possibilities for decision-making rather than predicting next observations?
contrasts/refines: builds purposeful simulation out of next-state prediction, complicating the essay's "not next observation" framing
Do LLMs actually have world models or just facts? The term 'world model' conflates two different capabilities: factual representation versus mechanistic understanding. Understanding which one LLMs actually possess matters for assessing their reasoning reliability.
exemplifies the attempt to reach the mechanistic (Sense 2) reading, without proving it over interpolation
What five design choices compose a world model? World models are often presented as monolithic systems, but they actually involve five distinct design decisions—data preparation, representation, reasoning architecture, training objective, and decision integration—that can each fail independently. Understanding this decomposition helps diagnose why world model proposals fall short.
grounds: the three-stage recipe instantiates the data/objective/integration design choices for a language world model

Can language models learn to simulate agent environments?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4