Open Challenges in World Models
In the last 12 months, several notable world models have been announced: Google DeepMind's Genie 3, World Labs' Marble 1.1, Meta's WorldGen, and more recently Tencent's HunyuanWorld 2.0 and Alibaba Group's Happy Oyster.
World models create and predict physical environments, a capability that is foundational for robotics, AVs, simulation, 3D, and game generation.
What makes world generation hard isn't visual quality. It's three things that are still open areas of research:
- Occlusion: when an object moves behind another, does the model know it's still there?
- Persistence: turn the camera away and back. Is the scene the same?
- Action-conditioning: push a ball. Does it roll according to physics, or does the model hallucinate a plausible-looking result? This is the big one, and most shipping models only handle it for camera and character movement. True physical interaction (grasping, pushing, breaking) is still largely confined to research.
Benchmarks like Stanford University's WorldScore are emerging, but they mostly measure visual quality and camera trajectory adherence. The harder questions (does the model actually understand that objects persist? that physical interactions have consequences?) don't have agreed-upon metrics yet.
We've seen this before. Text-to-image went through it starting in 2022, text-to-video starting in 2024. The capability ships but evaluation vocabulary lags behind. Evaluation methods developed in the next few years will determine where and how far we can push the capabilities of world models.
Video credit: Google DeepMind / Genie 3
Member discussion