World Models: A New Way to Simulate the World

Loading...

What happens when a Video Model can simulate reality perfectly?

World Models
AI
Video Generation
Vision

A New Kind of Medium

Imagine if you could step inside a dream. Not just watch it, but move around, change things, and see the world react in real time. Dreams feel loose and strange, yet they still follow rules like gravity, motion, cause and effect. They feel alive.

Now imagine that instead of your brain generating that dream, an AI model is doing it. A model that has learned how the real world behaves well enough to simulate one. A model you can talk to, control, and shape.

A screenshot from the movie Ready Player One, which most people mention
when thinking of this future of world
models.

This is the idea of a World Model.

For many people, the idea sounds abstract at first. It did for me too. But once you understand what a World Model can actually do, the feeling is electric. The concept sticks to your mind the same way the idea of chat assistants did when large language models first appeared. Something simple at first glance, turns into a new medium that changes how humans interact with knowledge, creativity and software.

World Models are the next leap. A completely new modality of intelligence. And by the end of this post, I want you not only to understand them but to feel why they matter.


Subscribe to my blogposts

Get updates on my latest projects and articles directly to your inbox. No spam.

Dreams, Games, and the Bridge Between Them

To understand where World Models come from, we need to look at two different paths that have been developing in parallel:

  1. Video Models that learn reality by observing it
    They learn how reality looks, mimicking textures, lighting, and motion patterns, without necessarily understanding the causal rules of how one moment leads to the next. They predict what a video should look like, not how a world works.

A video generated with a video model as described
above.

The logo of Wan, a video model that behaves exactly as
described.

A video generated with a video model as described
above.

  1. Models that learn games by acting inside them
    This path is historically motivated by robotics, focusing on learning environment dynamics rather than just visuals. Think of algorithms playing Chess or Go: they don't need to predict pixels, but they understand the rules, the actions, and how the state of the game changes. These models learn how to behave inside a closed, controllable system.

DeepMind's model,
AlphaGo

In games like Chess and Go, the observations represent the entire state of the game. Even if you only had a video of someone playing chess, it would be simple to extract the exact state (position of the pieces) with computer vision. But that is not the case in all games.

In partially observable environments like Doom, or probabilistic ones like Minecraft, you can't infer the state from one observation (one frame). And that's where you need intelligent systems that understand and build an internal representation of the world.

The logo of Wanda, a video model that behaves exactly as
described.

The logo of Wanda, a video model that behaves exactly as
described.

When comparing Video Models and Game Models, these paths seem unrelated. One learns how reality moves. The other learns how actions shape an environment. But if you follow these lines far enough, they start bending toward each other. And where they meet, the idea of a true World Model begins to form.

Let's walk the path.


How Models Learn to See the World

From Images to Moving Reality

Image diffusion models were the first real breakthrough in visual generation. They begin with pure noise and slowly transform it into a clear picture based on a text prompt.

Image diffusion

That was the first spark: a model that can imagine.

But pictures are frozen moments. They do not have movement or memory. So researchers extended the idea by adding a new dimension: time. Instead of a single image, the model now generates a whole block of frames. This is video diffusion.

Bidirectional video
diffusion

And something surprising happened.

Because these models trained on huge collections of real videos, they started learning the rules of the world. Not words in a textbook, but patterns of motion and physics:

  • The way waves fold
  • The way cars crash
  • The way fire flickers
  • The way people walk

They learned reality through observation.

Loading...

But these models also had two major limits:

  • The video length had to be decided before generation (10s)
  • The whole video had to be created at once before anything could be seen

In academia, this is usually called full-sequence diffusion or non-causal generation (sometimes referred to as bidirectional generation). Every frame depends on both the past and the future, so the model cannot generate the world step by step.

Imagination on the diamond
paper.

That means no interaction. No control. No real time. No world you can enter.

These models learned reality, but they could not let you live inside it.


How Models Learn to Act in Games

The Game Trained World Models

While video models were learning physics from observation, another group of researchers was working from a different direction.

Instead of training on videos of the real world, they trained on games.

Projects like DIAMOND, Dreamer and GameNGen fed models huge amounts of gameplay data: what the player saw, what the player did, and how the world reacted. These models learned:

  • how to predict the next frame
  • how the environment reacts to actions
  • how to simulate the entire game internally

Imagination on the diamond
paper.

In other words, they learned to be action-conditioned generators (where a policy can serve as a separate control module).

But these models had the opposite problem of video models:

  • They could act, but their worlds were limited (Only trained on ATARI games)
  • They understood only the actions inside their dataset (⬅️➡️↖️⬆️↗️↙️⬇️↘️🔥)
  • They could not generalize beyond the game's rules (Cannot create a new game)

Where video models understood reality but could not interact, game models could interact but did not understand reality.

These two lines of research were moving toward the same destination but from opposite ends.


Where These Two Paths Meet

The missing piece the thing that finally connects the "dream world" of videos with the "interactive world" of games is this:

Autoregressive Video Models

Researchers discovered how to turn a slow, bidirectional video diffusion model into a fast, frame by frame generator. This changes everything:

  • The model can react to actions
  • The video can continue indefinitely
  • Prompts can be changed during generation
  • The world becomes controllable
Loading...
💡

Odyssey's latest product is built entirely on this technology, on the premise of promptable real time video generation.

This is the moment where video models learn to act, and game models learn to generalize.

The two paths merge into one.

Loading...

Why Prompting Matters

One of the most exciting parts of World Models is the ability to change the world with language. Imagine saying:

  • turn the weather into a storm
  • make the floor ice
  • spawn a tree behind me
  • switch to night

This is the "thought" layer of the dream. But today, prompt control in autoregressive video systems is still weak. I have spent months experimenting with prompting for my own projects, and the limitations are clear.

Loading...

We are getting close, but not there yet.


A Thought Experiment: Inside a Dream

To understand the future of World Models, try this exercise.

Close your eyes and picture the room you are in.

Now imagine you can fly.
Imagine the walls melt into water.
Imagine the ground becomes grass.
Imagine you teleport to a beach with a single thought.

Loading...
💡

Video generated with Runway's Gen-4.5 model. (Bidirectional Video Model)

That is what a true World Model will feel like. A space that reacts to intention. A place made of learned physics, not fixed geometry. A world you can shape.

The dream metaphor is not an accident. Early diffusion videos genuinely look like dreams. Shifting forms. Soft edges. Strange transitions. They are attempts at reality made from memory, not actual physics.

World Models will turn those dreams into places you can inhabit.

Dream
screenshot

Dream
screenshot

Dream
screenshot

💡

I generated these dream-like images back in 2021, while experimenting with a Model that was trained on Minecraft screenshots.


The Hardest Problem: Memory

To feel real, worlds need persistence.

If you draw on a wall in a World Model, leave the room, and return later, the drawing should still be there. This sounds simple to humans but is extremely difficult for models.

A tempting idea is to store everything as a 3D mesh. But this breaks the magic:

  • a mesh cannot be freely changed through prompts
  • converting between mesh and model space destroys flexibility
  • it forces human interpretation onto a system that learns in its own language

Minecraft World Models show the issue clearly. Without proper memory, inventory items change randomly. You can fix it by hard coding inventory logic, but then you are locking the model into narrow, specific solutions. That approach will never scale.

Loading...

The only path forward is implicit memory memory stored inside the model's latent space, not in human designed structures.

DeepMind's Genie 3 is leading this direction, with about one minute of memory. That may sound small, but compared to the few seconds of most open models, it is enormous.

Loading...

Implicit memory is the bridge that lets a World Model stay stable across time.


The Ultimate World Model

When all the pieces come together, we get a model that is:

  • Controllable through actions
  • Promptable through language
  • Realtime so it reacts frame by frame
  • Consistent thanks to implicit memory
  • Endless with no fixed length
  • Realistic thanks to video training

It becomes a world you can inhabit.


Why World Models Matter Far Beyond Entertainment

World Models are not just a leap for games, they are a breakthrough for robotics. Training real robots is slow and expensive because every mistake risks damage and every new environment needs careful setup. With a World Model, a robot's policy can be trained inside a learned simulation. You do not need to hand-build physics or objects. You can simply show the model images of the space, and it already understands how things behave. If it sees a plate, it knows it can be picked up, dropped, or broken because it learned these patterns from the real world.

To make these models reliable, they must learn from the full range of possible situations, not just the ideal ones. When I trained a Mario Kart model, perfect driving videos taught it almost nothing. It had to see failure to understand the world. More than half the dataset came from crashing into walls, spinning out, or driving off track.

Humans learn the same way by bumping, testing, and failing.

You don't understand that fire hurts you until you actually get burned.

Loading...
💡

This video is Tesla's World Model in action. You can see in the bottom right corner inconsistencies such as the front grill of the truck changing in between frames.

Robotics needs to experience the full tail of the distribution, but collecting it in reality would destroy hardware and cost a fortune. Video-based World Models solve this by learning rare events directly from large datasets. A model that has seen thousands of examples of a car crash can simulate one without crashing an actual car.

This is the direction companies like Tesla are moving toward: training their systems inside a learned world before transferring the resulting policies to the real one.


My Inspiration for the Future

A year and a half ago I trained a tiny Mario Kart World Model. I had no idea what I was touching. But that experiment led me to a community, to researchers, to San Francisco, to friendships and to a technology that I believe will define an era.

Every time I explain World Models to someone, they light up. They start imagining new applications within seconds. They see worlds that could exist and experiences we have never had before. That reaction is why I am writing this.

So try this. Close your eyes. Imagine something from your past. A childhood home. A place you visited. Now imagine walking through it again. Imagine editing reality with a sentence. Imagine traveling anywhere instantly. Imagine entering worlds that never existed but feel as real as the ground under your feet.

Loading...
Loading...
Loading...
Loading...
Loading...

This is my vision for World Models. Today it is still a dream. But very soon, dreams will become something you can step inside.

Back

Subscribe to my blogposts

Get updates on my latest projects and articles directly to your inbox.
No spam, unsubscribe anytime.