Can language models learn to simulate agent environments?
Explores whether training language models to predict next states across diverse agent domains can create transferable world models that improve agent performance beyond real-world interaction alone.
The agent-environment loop has two components: the policy (states → actions) and the world model ((states, actions) → next states). Qwen-AgentWorld's framing is that LLM-agent research has obsessed over the policy and almost entirely neglected the world model — and that this is a load-bearing gap, citing Richens et al.'s result that any agent generalizing broadly must have learned a world model. The contribution is a native language world model (35B-A3B and 397B-A17B) that simulates agentic environments across seven domains via long chain-of-thought, trained on 10M+ real interaction trajectories through a three-stage "CPT injects, SFT activates, RL sharpens" recipe targeting next-state prediction.
Two claims earn their keep. First, as a decoupled simulator, controllable simulation beats both uncontrolled simulation and — strikingly — real-environment training on three agentic benchmarks, suggesting a trained world model can manufacture cleaner, more targeted experience than reality supplies. Second, as a foundation, LWM warm-up improves downstream agent performance across all seven tasks via cross-domain transfer, positioning next-state prediction as a transferable pretraining objective for agents the way next-token prediction is for language.
This refines the conceptual debates. It operationalizes What should a world model actually be designed to do? — though notably Qwen-AgentWorld's mechanism is next-state prediction, so it either contradicts that essay's "not next observation" claim or shows that purposeful simulation can be built out of next-state prediction at sufficient scale and reasoning depth. It also bears on Do LLMs actually have world models or just facts?: a model that controllably simulates state transitions is reaching for the mechanistic (Sense 2) reading, but benchmark success on seven curated domains does not prove a compact generative model of mechanisms versus sophisticated interpolation over 10M trajectories. The strongest counterargument is the simulation-fidelity ceiling: an agent trained on simulated rather than real experience inherits the world model's errors, and "surpassing real-environment training" may hold only where the simulator's blind spots happen not to matter.
Inquiring lines that use this note as a source 5
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Does next-state prediction alone build mechanistic world models or just sophisticated interpolation?
- Can simulation fidelity limit what agents learn from trained world models?
- Why has agent research prioritized policy over world model development?
- Does iterative computation for reasoning transfer to environment dynamics modeling?
- How do spectral-norm constraints prevent divergence in world model rollouts?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
What should a world model actually be designed to do?
Current AI research treats world models as either video predictors or RL dynamics learners, but what if their real purpose is simulating actionable possibilities for decision-making rather than predicting next observations?
contrasts/refines: builds purposeful simulation out of next-state prediction, complicating the essay's "not next observation" framing
-
Do LLMs actually have world models or just facts?
The term 'world model' conflates two different capabilities: factual representation versus mechanistic understanding. Understanding which one LLMs actually possess matters for assessing their reasoning reliability.
exemplifies the attempt to reach the mechanistic (Sense 2) reading, without proving it over interpolation
-
What five design choices compose a world model?
World models are often presented as monolithic systems, but they actually involve five distinct design decisions—data preparation, representation, reasoning architecture, training objective, and decision integration—that can each fail independently. Understanding this decomposition helps diagnose why world model proposals fall short.
grounds: the three-stage recipe instantiates the data/objective/integration design choices for a language world model
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Qwen-AgentWorld: Language World Models for General Agents
- Agent Learning via Early Experience
- Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks
- Nex-N1: Agentic Models Trained via a Unified Ecosystem for Large-Scale Environment Construction
- Can Language Models Serve as Text-Based World Simulators?
- Survey on Evaluation of LLM-based Agents
- A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence
- Training-Free Group Relative Policy Optimization
Original note title
a native language world model supplies the missing half of the agent loop — and trained simulation can scale agents beyond real-environment interaction