SYNTHESIS NOTE

Can language models learn to simulate agent environments?

Explores whether training language models to predict next states across diverse agent domains can create transferable world models that improve agent performance beyond real-world interaction alone.

Synthesis note · 2026-06-27 · sourced from LLM Architecture

The agent-environment loop has two components: the policy (states → actions) and the world model ((states, actions) → next states). Qwen-AgentWorld's framing is that LLM-agent research has obsessed over the policy and almost entirely neglected the world model — and that this is a load-bearing gap, citing Richens et al.'s result that any agent generalizing broadly must have learned a world model. The contribution is a native language world model (35B-A3B and 397B-A17B) that simulates agentic environments across seven domains via long chain-of-thought, trained on 10M+ real interaction trajectories through a three-stage "CPT injects, SFT activates, RL sharpens" recipe targeting next-state prediction.

Two claims earn their keep. First, as a decoupled simulator, controllable simulation beats both uncontrolled simulation and — strikingly — real-environment training on three agentic benchmarks, suggesting a trained world model can manufacture cleaner, more targeted experience than reality supplies. Second, as a foundation, LWM warm-up improves downstream agent performance across all seven tasks via cross-domain transfer, positioning next-state prediction as a transferable pretraining objective for agents the way next-token prediction is for language.

This refines the conceptual debates. It operationalizes What should a world model actually be designed to do? — though notably Qwen-AgentWorld's mechanism is next-state prediction, so it either contradicts that essay's "not next observation" claim or shows that purposeful simulation can be built out of next-state prediction at sufficient scale and reasoning depth. It also bears on Do LLMs actually have world models or just facts?: a model that controllably simulates state transitions is reaching for the mechanistic (Sense 2) reading, but benchmark success on seven curated domains does not prove a compact generative model of mechanisms versus sophisticated interpolation over 10M trajectories. The strongest counterargument is the simulation-fidelity ceiling: an agent trained on simulated rather than real experience inherits the world model's errors, and "surpassing real-environment training" may hold only where the simulator's blind spots happen not to matter.

Inquiring lines that use this note as a source 5

This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
13 direct connections · 110 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

a native language world model supplies the missing half of the agent loop — and trained simulation can scale agents beyond real-environment interaction