
WoW world model generates high-quality, physically consistent robot action videos in Out-of-Distribution (OOD) scenarios, enabling closed-loop corrections and real-world robotic execution. The illustration shows the model's strong generalization across diverse tasks and environments.
Humans develop an understanding of intuitive physics through active interaction with the world. In stark contrast, current video models such as Sora rely solely on passive observation and therefore struggle with grasping physical causality. This motivates our central hypothesis: authentic physical intuition in a world model must be grounded in extensive, causally rich interactions with the real world. To test this, we introduce WoW, a 14B-parameter generative world model trained on 2 million real-world robot interaction trajectories. We find that the model’s understanding of physics emerges as a probabilistic distribution of plausible outcomes, which can lead to stochastic instabilities and physical hallucinations. To mitigate these, we propose SOPHIA, a novel vision-language agent that evaluates the output of the DiT model and iteratively refines the language instructions to steer generation toward physical realism. Complementing this, a co-trained Inverse Dynamics Model translates refined plans into executable robotic actions, effectively closing the imagination-to-action loop. We further establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluations, excelling in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is essential for developing physical intuition in AI.
Zero-shot daily Scenarios: Natural environments and daily life activities including workplace tasks, and everyday interactions demonstrating practical applicability.
Zero-shot Generation of front view Humanoid robots in diverse environments, a novel front view, showcasing the model's ability to understand and predict complex interactions in dynamic settings.
Physics Simulation: WoW demonstrates sophisticated understanding of physical properties including object splitting, deformation, lighting effects, adhesion, and counterfactual reasoning.
counterfact Imagination: Diverse counterfactual scenarios showcasing WoW's ability to imagine and generate physically plausible outcomes under altered conditions, such as different angles, magnetic properties, and energy effects.
Long Horizon Tasks. WoW excels in generating extended sequences of robot actions, demonstrating its capability to plan and execute complex tasks over longer time horizons with sustained physical consistency.
Artistic Interaction: WoW demonstrates understanding of 2D-3D relationships by enabling robotic hands to extract objects from famous paintings, showcasing cross-modal reasoning between visual art and physical manipulation.
@misc{chi2025wowworldomniscientworld,
title={WoW: Towards a World omniscient World model Through Embodied Interaction},
author={Xiaowei Chi and Peidong Jia and Chun-Kai Fan and Xiaozhu Ju and Weishi Mi and Kevin Zhang and Zhiyuan Qin and Wanxin Tian and Kuangzhi Ge and Hao Li and Zezhong Qian and Anthony Chen and Qiang Zhou and Yueru Jia and Jiaming Liu and Yong Dai and Qingpo Wuwu and Chengyu Bai and Yu-Kai Wang and Ying Li and Lizhang Chen and Yong Bao and Zhiyuan Jiang and Jiacheng Zhu and Kai Tang and Ruichuan An and Yulin Luo and Qiuxuan Feng and Siyuan Zhou and Chi-min Chan and Chengkai Hou and Wei Xue and Sirui Han and Yike Guo and Shanghang Zhang and Jian Tang},
year={2025},
eprint={2509.22642},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2509.22642},
}