Why has agent research prioritized policy over world model development?
This explores why the field has poured effort into teaching agents what action to take next (the policy) while largely neglecting the agent's internal model of how its environment will respond (the world model) — and what that imbalance costs.
This reads the question as being about a structural lopsidedness in agent research: we have spent enormously on learning the *policy* — the function that picks the next action — and comparatively little on the *world model*, the agent's ability to simulate what happens if it acts. The corpus suggests this isn't an accident of taste so much as a consequence of how the field trains and grades agents.
The training side explains a lot. Most agents are built by imitating expert demonstrations, and demonstrations only ever encode actions, not consequences — so the agent inherits a policy and never has to build a model of the world that produced those actions. The cost is a hard ceiling: agents trained on static expert data can't learn from their own failures or generalize past what the curator imagined Can agents learn beyond what their training data shows?. One recent line of work frames the world model as literally "the missing half of the agent loop" — training a native language world model on millions of trajectories via next-state prediction, and finding it beats real-environment training and transfers across domains Can language models learn to simulate agent environments?. The very phrasing "missing half" tells you which half the field built first.
The grading side reinforces it. Benchmarks reward policies that clear contests, not agents that model a messy world — and the field optimizes what it measures. The benchmark-to-GDP gap shows agents acing abstract tasks while failing real long-horizon work, because the tests measured contests rather than the open-ended environments where a world model would earn its keep Why do agent benchmarks not predict real economic value?. When you only score the action, you never have to pay for the simulation behind it.
There's also a deeper reason world models got shortchanged: they're easy to fake. A model can hit high prediction accuracy through task-specific heuristics without ever building a coherent generative model of how the world works What makes a world model actually useful for reasoning?. A genuinely useful world model isn't a next-frame predictor — it has to simulate *actionable possibilities*, the counterfactual and intervention space an agent reasons over before choosing What should a world model actually be designed to do?. That's a much harder target than fitting a policy, so the field took the tractable path. The failure shows up sharply under information asymmetry: LLMs look socially competent when one model secretly controls everyone, but collapse when agents hold private information — exposing that the apparent competence was policy mimicry skipping the world-modeling work Why do LLMs fail when simulating agents with private information?.
What the curious reader might not expect is where the corpus thinks the fix actually lives. Reliability in long-horizon tasks turns out to track *persistence in feedback loops* — repeatedly acting, observing, and incorporating results — more than initial answer quality What predicts success in ultra-long-horizon agent tasks?. And reliable agents get there less by scaling the model and more by externalizing memory, skills, and protocols into a surrounding harness Where does agent reliability actually come from?. In other words, the world model the field underbuilt is quietly being reconstructed outside the model — in the loop and the scaffolding — rather than learned inside the policy.
Sources 8 notes
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Qwen-AgentWorld demonstrates that native language world models trained via next-state prediction on 10M+ trajectories outperform real-environment training on three benchmarks and transfer across seven domains, positioning next-state prediction as a foundation objective for agents.
ALE's analysis of 960 real occupational workflows shows agents excel at abstract contests but fail long-horizon professional tasks. The gap is not model capability but benchmark design—the field optimizes what it measures, and it has measured contests rather than work.
Research shows LLMs may achieve high prediction accuracy through task-specific heuristics without developing coherent generative models of how the world works. True world models must enable reasoning about interventions and counterfactuals, not surface regularities.
Drawing on hypothetical thinking in psychology, world models are most useful when designed to simulate all actionable possibility spaces—physical, embodied, emotional, social, mental, counterfactual, and evolutionary—grounded in agent decision-making rather than passive prediction.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
Across 17 frontier models on 36 expert-curated optimization tasks, repeated benchmark-edit-incorporate cycles within a wall-clock budget proved the dominant success predictor. Most models terminated early or burned budget unproductively; Claude Opus 4.6 stood out as persistent.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.