Can simulation fidelity limit what agents learn from trained world models?
This explores whether the realism of a learned simulation (a 'world model' an agent trains inside) caps what the agent can actually learn — i.e., does the agent only ever learn what the simulator was good enough to show it?
This explores whether the realism of a learned simulation puts a ceiling on agent learning — does an agent only learn what its trained world model is faithful enough to reproduce? The corpus says yes, and the cleaner way to see it is that fidelity limits are really *imagination* limits. The optimistic result is that language world models trained on millions of trajectories can scale agent learning past what real environments offer, even transferring across domains Can language models learn to simulate agent environments?. But that promise inherits a hard constraint visible everywhere agents learn from a frozen artifact: agents trained on static expert demonstrations can't learn from their own failures or step outside what the curator imagined Can agents learn beyond what their training data shows?. A world model is just a learned curator — its blind spots become the agent's blind spots.
The sharpest fidelity failure isn't visual or physical realism; it's *what the simulator quietly does for you that the real world won't*. Social simulations look competent when one model puppets every character, then collapse the moment agents must act on private information they don't share — the model was skipping the grounding work real situations demand Why do LLMs fail when simulating agents with private information?. That's a fidelity gap masquerading as success: the agent learns from a world that's too cooperative, too omniscient, too smooth. The deeper diagnosis is that high prediction accuracy doesn't mean a coherent model of how things work — a system can nail next-state prediction through task-specific shortcuts while having no ability to reason about interventions or counterfactuals What makes a world model actually useful for reasoning?. An agent trained against that kind of simulator learns the shortcuts, not the world.
There's a structural reason simulation gaps bite. Once a model is post-trained, it stops treating its outputs as passive predictions and starts treating them as actions that shape its own future inputs — it's running inside a closed action–perception loop Do models recognize their own outputs as actions shaping future inputs?. If that loop is closed inside a flawed simulator, errors compound on the simulator's terms. And RL inside any fixed environment tends to *narrow* rather than broaden: policies collapse toward whatever the reward (or the simulator) makes easy, squeezing out exploration diversity — the same entropy collapse seen in reasoning and search agents Does reinforcement learning squeeze exploration diversity in search agents?. So a limited world model doesn't just fail to teach new things; it actively funnels the agent into the simulator's comfortable regions.
The interesting twist is what the corpus offers as escape routes, and they mostly route *around* fidelity rather than chasing it. Instead of needing a more faithful simulator, agents can learn from unambiguous real feedback by storing verbal reflections in episodic memory — a binary success/failure signal is hard to rationalize away, so the agent improves without weight updates Can agents learn from failure without updating their weights?. The same memory-as-learning move scales: treating adaptation as memory operations over real cases lets agents keep improving without touching model parameters at all Can agents learn continuously from experience without updating weights?. More broadly, reliability comes from externalizing memory, skills, and protocols into a harness rather than from a better internal model of the world Where does agent reliability actually come from?.
So the answer the corpus leaves you with is sharper than 'better simulators help.' Simulation fidelity does bound learning — but the binding constraint is the simulator's *imagination and convenience*, not its resolution. You raise the ceiling either by making the world model spend real computation on hard steps instead of memorizing easy ones Can looped computation replace parameter count in world models?, or — more reliably — by letting the agent touch reality through cheap, hard-to-fake feedback signals and remember what happened. The thing you didn't know you wanted to know: agents will even repurpose the environment itself as memory when learning pressure is high enough Do RL agents accidentally use environments as memory?, which is exactly the kind of grounded, situated learning a too-tidy simulator never lets them discover.
Sources 11 notes
Qwen-AgentWorld demonstrates that native language world models trained via next-state prediction on 10M+ trajectories outperform real-environment training on three benchmarks and transfer across seven domains, positioning next-state prediction as a foundation objective for agents.
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
Research shows LLMs may achieve high prediction accuracy through task-specific heuristics without developing coherent generative models of how the world works. True world models must enable reasoning about interventions and counterfactuals, not surface regularities.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
AgentFly formalizes agent learning as a Memory-augmented MDP with three memory modules (case, subtask, tool) that enable credit assignment and policy improvement entirely through memory operations. The approach achieved 87.88% on GAIA validation without modifying LLM parameters.
Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.
LoopWM achieves up to 100x parameter efficiency by refining latent environment states through iterative computation in a shared block, with spectral-norm constraints providing formal stability guarantees. The approach mirrors physical system recurrence, spending more depth on harder prediction steps.
Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.