WoW

Towards a World-omniscient World-model Through Embodied Interaction

Xiaowei Chi^1,2,3^†, Peidong Jia^1,2^†, Chun-Kai Fan^1,2^†, Xiaozhu Ju¹^†, Weishi Mi¹^†, Kevin Zhang², Zhiyuan Qin¹, Wanxin Tian¹, Kuangzhi Ge², Hao Li¹, Zezhong Qian^1,2, Anthony Chen², Qiang Zhou^1,2, Yueru Jia², Jiaming Liu², Yong Dai¹, Qingpo Wuwu², Chengyu Bai², Yu-Kai Wang², Ying Li², Lizhang Chen^1,2, Yong Bao¹, Zhiyuan Jiang¹, Jiacheng Zhu¹, Kai Tang², Ruichuan An², Yulin Luo², Qiuxuan Feng^1,2, Siyuan Zhou³, Chi-min Chan³, Chengkai Hou^1,2, Wei Xue³, Sirui Han³, Yike Guo³, Shanghang Zhang^2,^✉ Jian Tang^1,^✉
¹ Beijing Innovation Center of Humanoid Robotics
² State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University
³ Hong Kong University of Science and Technology

[pdf] [arxiv] [code]

Abstract.

WoW world model generates high-quality, physically consistent robot action videos in Out-of-Distribution (OOD) scenarios, enabling closed-loop corrections and real-world robotic execution. The illustration shows the model's strong generalization across diverse tasks and environments.

Humans develop an understanding of intuitive physics through active interaction with the world. In stark contrast, current video models such as Sora rely solely on passive observation and therefore struggle with grasping physical causality. This motivates our central hypothesis: authentic physical intuition in a world model must be grounded in extensive, causally rich interactions with the real world. To test this, we introduce WoW, a 14B-parameter generative world model trained on 2 million real-world robot interaction trajectories. We find that the model’s understanding of physics emerges as a probabilistic distribution of plausible outcomes, which can lead to stochastic instabilities and physical hallucinations. To mitigate these, we propose SOPHIA, a novel vision-language agent that evaluates the output of the DiT model and iteratively refines the language instructions to steer generation toward physical realism. Complementing this, a co-trained Inverse Dynamics Model translates refined plans into executable robotic actions, effectively closing the imagination-to-action loop. We further establish WoWBench, a new benchmark focused on physical consistency and causal reasoning in video, where WoW achieves state-of-the-art performance in both human and autonomous evaluations, excelling in physical causality, collision dynamics, and object permanence. Our work provides systematic evidence that large-scale, real-world interaction is essential for developing physical intuition in AI.

Real-world Scenarios.

Zero-shot daily Scenarios: Natural environments and daily life activities including workplace tasks, and everyday interactions demonstrating practical applicability.

Robot Manipulation.

Zero-shot Generation of front view Humanoid robots in diverse environments, a novel front view, showcasing the model's ability to understand and predict complex interactions in dynamic settings.

Physics Simulation.

Physics Simulation: WoW demonstrates sophisticated understanding of physical properties including object splitting, deformation, lighting effects, adhesion, and counterfactual reasoning.

counterfact Imagination

counterfact Imagination: Diverse counterfactual scenarios showcasing WoW's ability to imagine and generate physically plausible outcomes under altered conditions, such as different angles, magnetic properties, and energy effects.

Long Horizon Tasks.

Long Horizon Tasks. WoW excels in generating extended sequences of robot actions, demonstrating its capability to plan and execute complex tasks over longer time horizons with sustained physical consistency.

Artistic Interaction.

Artistic Interaction: WoW demonstrates understanding of 2D-3D relationships by enabling robotic hands to extract objects from famous paintings, showcasing cross-modal reasoning between visual art and physical manipulation.

More Result & Open Source Plan.

DiT/UNet checkpoints: We will update the SVD, CogVideoX, Cosmos1&2, Wan-14B, and especially, our WoW-DiT 2B,7B, 14B, model group in Oct.
Video-to-Video Transfer: Stay tuned for updates!
Real-world VLA demo coming soon: Stay tuned for updates!
Online testing demo: Stay tuned for updates!

BibTeX

@misc{chi2025wowworldomniscientworld,
      title={WoW: Towards a World omniscient World model Through Embodied Interaction}, 
      author={Xiaowei Chi and Peidong Jia and Chun-Kai Fan and Xiaozhu Ju and Weishi Mi and Kevin Zhang and Zhiyuan Qin and Wanxin Tian and Kuangzhi Ge and Hao Li and Zezhong Qian and Anthony Chen and Qiang Zhou and Yueru Jia and Jiaming Liu and Yong Dai and Qingpo Wuwu and Chengyu Bai and Yu-Kai Wang and Ying Li and Lizhang Chen and Yong Bao and Zhiyuan Jiang and Jiacheng Zhu and Kai Tang and Ruichuan An and Yulin Luo and Qiuxuan Feng and Siyuan Zhou and Chi-min Chan and Chengkai Hou and Wei Xue and Sirui Han and Yike Guo and Shanghang Zhang and Jian Tang},
      year={2025},
      eprint={2509.22642},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2509.22642}, 
}

Content