WoW
Towards a World-omniscient World-model Through Embodied Interaction
Xiaowei Chi1,2,3, Peidong Jia1,2, Chun-Kai Fan1,2, Xiaozhu Ju1, Weishi Mi1, Kevin Zhang2, Zhiyuan Qin1, Wanxin Tian1, Kuangzhi Ge2, Hao Li1, Zezhong Qian1,2, Anthony Chen2, Qiang Zhou1,2, Yueru Jia2, Jiaming Liu2, Yong Dai1, Qingpo Wuwu2, Chengyu Bai2, Yu-Kai Wang2, Ying Li2, Lizhang Chen1,2, Yong Bao1, Zhiyuan Jiang1, Jiacheng Zhu1, Kai Tang2, Ruichuan An2, Yulin Luo2, Qiuxuan Feng1,2, Siyuan Zhou3, Chi-min Chan3, Chengkai Hou1,2, Wei Xue3, Sirui Han3, Yike Guo3, Shanghang Zhang2,    Jian Tang1,
1 Beijing Innovation Center of Humanoid Robotics
2 State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
3 Hong Kong University of Science and Technology
Abstract.

WoW world model generates high-quality, physically consistent robot action videos in Out-of-Distribution (OOD) scenarios, enabling closed-loop corrections and real-world robotic execution. The illustration shows the model's strong generalization across diverse tasks and environments.

Humans develop an understanding of intuitive physics through active interaction with the world. In stark contrast, current video models such as Sora rely solely on passive observation and therefore struggle with grasping physical causality. This motivates our central hypothesis: authentic physical intuition in a world model must be grounded in extensive, causally rich interactions with the real world. To test this, we introduce WoW, a 14B-parameter generative world model trained on 2 million real-world robot interaction trajectories. We find that the model’s understanding of physics emerges as a probabilistic distribution of plausible outcomes, which can lead to stochastic instabilities and physical hallucinations. To mitigate these, we propose SOPHIA, a novel vision-language agent that evaluates the output of the DiT model and iteratively refines the language instructions to steer generation toward physical realism. Complementing this, a co-trained Inverse Dynamics Model translates refined plans into executable robotic actions, effectively closing the imagination-to-action loop. We further establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluations, excelling in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is essential for developing physical intuition in AI.

Real-world Scenarios.
Robot Manipulation.
Physics Simulation.
counterfact Imagination

counterfact Imagination: Diverse counterfactual scenarios showcasing WoW's ability to imagine and generate physically plausible outcomes under altered conditions, such as different angles, magnetic properties, and energy effects.

Long Horizon Tasks.
Artistic Interaction.
More Result & Open Source Plan.
BibTeX
@misc{chi2025wowworldomniscientworld,
      title={WoW: Towards a World omniscient World model Through Embodied Interaction}, 
      author={Xiaowei Chi and Peidong Jia and Chun-Kai Fan and Xiaozhu Ju and Weishi Mi and Kevin Zhang and Zhiyuan Qin and Wanxin Tian and Kuangzhi Ge and Hao Li and Zezhong Qian and Anthony Chen and Qiang Zhou and Yueru Jia and Jiaming Liu and Yong Dai and Qingpo Wuwu and Chengyu Bai and Yu-Kai Wang and Ying Li and Lizhang Chen and Yong Bao and Zhiyuan Jiang and Jiacheng Zhu and Kai Tang and Ruichuan An and Yulin Luo and Qiuxuan Feng and Siyuan Zhou and Chi-min Chan and Chengkai Hou and Wei Xue and Sirui Han and Yike Guo and Shanghang Zhang and Jian Tang},
      year={2025},
      eprint={2509.22642},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.22642}, 
}