Beta

DeepMind Is Simulating Entire Worlds - What They're Building Will Shock You

Below is a short summary and detailed review of this video written by FutureFactual:

World Models, Genie and the Next Step Toward Embodied AI

Podcast overview

In this World, the Universe and Us edition from New Scientists, host Josh Hajigo speaks with Google DeepMind engineer Jack Parker Holder about world models, Genie, and the push toward embodied AI. The conversation spans definitions, demonstrations, and real world applications, including collaborations with Waymo and the deployment of humanoid robots.

Key insights

  • World models simulate environmental dynamics conditioned on actions, enabling learning from data-rich simulations rather than the real world.
  • Genie is a data driven, multimodal world model trained from broad video and image data to generate new environments and support agent control.
  • Genie 3 combines text and image inputs with real time interaction, delivering higher fidelity visuals and more consistent worlds.
  • The discussion explores the role of simulation in training safe, reliable robots and the path toward general intelligence through embodied AI.

Introduction to world models and Genie

The episode centers on the concept of world models, which are AI systems that predict the next state of an environment given actions. Jack Parker Holder from Google DeepMind provides a practical definition, noting that world models enable agents to simulate environmental dynamics, learn from those simulations, and apply what they learn to real world tasks. This contrasts with language models like those used in ChatGPT or Gemini, which predict the next token in text. World models extend prediction into multimodal data streams such as video and audio and incorporate physical intuition learned from interactions with the world.

From Minecraft to the real world

Listeners are reminded that while Minecraft and other simulated environments can model some aspects of reality, the gulf between a game and the real world is vast. Subtleties such as facial expressions, water interactions, and wind behavior are difficult to encode by hand. The guests discuss how advances in generative AI, particularly in image and video modeling, have begun to close this gap by providing models that can simulate physics and complex environments in a more data driven way. The overarching bet is that learning from real data and generating new worlds will become increasingly capable representations of the real world that AI agents can learn from and operate within.

The human world model and abstraction

The conversation turns to how humans model the world. The host and guest compare human world models with AI ones and discuss abstraction levels. Humans tend to focus on high level, functional outcomes rather than every fine detail. AI researchers face analogous questions about fidelity and representation: should world models be highly detailed or abstract enough to capture essential dynamics for planning and interaction?

Genie: origins, evolution and capabilities

Genie began in 2022 when reinforcement learning sources for training environments were waning. The project aimed to build a model of environments from data, including learning to extract action labels from video data where none existed. Genie 1, released in 2024, demonstrated the ability to learn from unlabelled video and later handle arbitrary videos with action labels. Genie 2 followed, and Genie 3 introduced substantial improvements in video quality through close collaboration with Veo, a leading video model. Genie 3 now accepts both text and image inputs, and in the latest iteration, can also process video inputs. A key feature is that Genie 3 produces visually consistent worlds where returning to a location yields similar outcomes, enabling more reliable experimentation and interaction with AI agents and humans alike.

How Genie 3 works in practice

Genie 3 accepts actions as input, which can be driven by the user or by an autonomous AI agent. It can generate new environments and control how the agent interacts with them. The model demonstrates emergent behavior such as blending different concepts in sensible ways when prompted with correlated inputs. For example, a video showing people in a city can be paired with a prompt describing a mountain environment, and Genie 3 integrates these prompts to adapt the scene. In a notable demonstration, Genie responded to a real world team video by transforming the on screen world into a jungle with dinosaurs, illustrating the model’s capacity to understand and manipulate context within a simulation.

Applications and safety implications

The World, the Universe and Us conversation emphasizes embodied AI as a practical path toward general intelligence. The interview discusses how simulations can be used to train and evaluate robots, especially for long tail or rare events that are not captured in training data. Partners like Waymo demonstrate how Genie can be used to simulate difficult driving scenarios, including snow covered bridges or unexpected obstacles like an elephant, helping to foresee and mitigate potential safety issues before real world deployment. The interview also touches on the broader societal impacts and safety considerations of increasingly capable embodied AI systems.

Future directions and philosophical questions

Towards the end, the discussion considers whether world models are strictly necessary for AGI or whether other approaches could reach similar levels of general intelligence. The speakers also address questions around machine understanding, reasoning and interpretability, including the idea of thinking traces in language models and examples of concepts the Genie system demonstrates. They close with reflections on responsible AI development, the importance of robust evaluation with diverse scenarios, and the potential for AI to assist human learning and exploration in ways that go beyond traditional media consumption.

Conclusion

The episode presents a candid view of where world models and simulation based AI stand, how Genie has progressed through multiple generations, and how these advances could influence robotics, autonomous systems and future artificial general intelligence. It highlights the practical value of training in simulated environments and the ongoing work to make those simulations safe, diverse, and useful for real world deployment.

Related posts

featured
Google DeepMind
·10/12/2025

Google DeepMind robotics lab tour with Hannah Fry

featured
The Royal Institution
·22/07/2025

Will AI outsmart human intelligence? - with 'Godfather of AI' Geoffrey Hinton