DeepMind, Google’s AI research org, has unveiled a model that can generate an “endless” variety of playable 3D worlds.
Called Genie 2, the model — the successor to DeepMind’s Genie, which was released earlier this year — can generate an interactive, real-time scene from a single image and text description (e.g. “A cute humanoid robot in the woods”). In this way, it’s similar to models under development by Fei-Fei Li’s company, World Labs, and Israeli startup Decart.
DeepMind claims that Genie 2 can generate a “vast diversity of rich 3D worlds,” including worlds in which users can take actions like jumping and swimming by using a mouse or keyboard. Trained on videos, the model’s able to simulate object interactions, animations, lighting, physics, reflections, and the behavior of “NPCs.”
Many of Genie 2’s simulations look like AAA video games — and the reason could well be that the model’s training data contains playthroughs of popular titles. But DeepMind, like many AI labs, wouldn’t reveal many details about its data sourcing methods, for competitive reasons or otherwise.
One wonders about the IP implications. DeepMind — being a Google subsidiary — has unfettered access to YouTube, and Google has previously implied that its ToS gives it permission to use YouTube videos for model training. But is Genie 2 basically creating unauthorized copies of the video games it “watched”? That’s for the courts to decide.
DeepMind says that Genie 2 can generate consistent worlds with different perspectives, like first-person and isometric views, for up to a minute, with the majority lasting 10-20 seconds.
“Genie 2 responds intelligently to actions taken by pressing keys on a keyboard, identifying the character and moving it correctly,” DeepMind wrote in a blog post. “For example, our model [can] figure out that arrow keys should move a robot and not trees or clouds.”
Most models like Genie 2 — world models, if you will — can simulate games and 3D environments, but with artifacting, consistency, and hallucinatory issues. For example, Decart’s Minecraft simulator, Oasis, has a low resolution and quickly “forgets” the layout of levels.
Genie 2, however, can remember parts of a simulated scene that aren’t in view and render them accurately when they become visible again, DeepMind claims. (World Labs’ models can do this too.)
Now, games created with Genie 2 wouldn’t be all that fun, really. Having your progress erased every minute would drive anyone up the wall. So DeepMind’s positioning the model as more of a research and creative tool — a tool for prototyping “interactive experiences” and evaluating AI agents.
“Thanks to Genie 2’s out-of-distribution generalization capabilities, concept art and drawings can be turned into fully interactive environments,” DeepMind wrote. “And by using Genie 2 to quickly create rich and diverse environments for AI agents, our researchers can generate evaluation tasks that agents have not seen during training.”
DeepMind says that while Genie 2 is in the early stages, the lab believes it’ll be a key component in developing AI agents of the future.
Google has poured increasing resources into world models, which promise to be the next big thing in AI. In October, DeepMind hired Tim Brooks, who was heading development on OpenAI’s Sora video generator, to work on video generation technologies and world simulators.