
Project Genie, an experimental Artificial Intelligence (AI) model released by Google in January, is an impressive technical feat. Give the tool a request, such as an image or a short snippet of text, and it will generate an interactive world for the user to explore.
If you type a simple query, the result is a realistic simulation. If, on the other hand, you start from a painting by Georges Seurat, you can take a Sunday stroll in the park in the artist's perfect style.
Project Genie may look like a video game, but its creators claim it's something much deeper. They call it a "world model," a crucial tool for helping AI systems understand the complex and unpredictable physical spaces where many of them will be placed to work in the future.
The company argues that a future where humanoid robots go to the store to buy ingredients before cooking dinner, or where autonomous cars drive on rural roads, would not be possible without models of the world.
The concept dates back to a 1943 book by Kenneth Craik, a Scottish psychologist, who suggested that organisms carry within their minds a “small-scale model” of the world, on which they test hypotheses before applying them to reality.
Having some understanding of how the world works is a necessary step before making plans to change it. Without one, every living being would be forced to live in a purely reactive manner, recoiling from pain, searching for food, and little else.
Giving AI systems this ability was a promising area of research back in the 1990s, before large language models (LLMs) captured the world's attention. Now, that attention has returned.
There are three main approaches being explored to build models of the world. A natural starting point is AI video generators. Generating coherent video depends on simulating a coherent world: if the laws of reality change from one frame to the next, the result would be meaningless.
Such rudimentary models of the world can fill in details beyond what they are given: give them a picture of a maze and they can draw a path through it; show them a picture of hands holding a jar and they will accurately model the movements needed to open it.
Project Genie is the culmination of this approach. Its usefulness becomes clear when imagined in combination with another AI, for example a sales robot, that is trying to learn how to act in the physical world.
The billions of hours of training data required for such a task would be much harder to collect from the real world than from a model that can simulate the environment. And if the simulations are accurate enough, the system can use this data to train itself.
However, even the most realistic video in the world can't capture every detail that a human would notice. For example, the broken refrigerator in the back of the store that's causing the fresh fish to rot isn't captured by the camera, nor is the smell that accompanies it.
Even objects that are not directly visible remain beyond it. If the content of a hallway in a store is generated, for example, the adjacent hallways do not exist for the model until the user enters them. This makes it more difficult to simulate complex environments or allow multiple users to move around in the same model.
Three-dimensional environment
Another approach to building models of the world aims, therefore, to create full three-dimensional environments rather than two-dimensional simulations. Fei-Fei Li, a computer scientist at Stanford University, is leading an approach she calls spatial intelligence.
According to her, models of the world must be interactive, multimodal (capable of interpreting different demands) and sustainable.
Video-based systems can overcome the first two hurdles, but struggle with the third. Project Genie, for example, runs for a maximum of 60 seconds before its simulations start to break down.
Dr Li's startup, World Labs, has built a world model called Marble, which can create digital versions of three-dimensional worlds that are internally consistent and complete.
This means that it is possible, for example, for several users to be inside the same world at the same time. Furthermore, spaces are not created from scratch every time the user looks around; rather, they are created in their entirety from the start.
World Labs is offering its product to architects, who can use it to imagine a space and explore it virtually before sending it to a 3D printer.
Yann LeCun, former lead AI scientist at Meta, thinks that models of the world can be built in a different, less direct way. For him, focusing on real spaces is a distraction.
Ultimately, many AIs will have to navigate virtual labyrinths like human resources systems or legal documents, not just physical spaces like stores. He believes that equipping AI with the tools to consistently model both types of environments is an important step toward making it useful.
According to him, an AI can use a large language model to interact with such a model of the world and help it perform tasks, whether in the real world or on a computer.
mODELS
This approach, called Joint-Embedding Predictive Architecture (JEPA), would allow an AI to simulate complex features of the real world. Existing models of the world focus on what will happen immediately, rather than events that may (or may not) occur in the more distant future.
People think ahead all the time: assessing the weather before deciding whether to leave the house with an umbrella; considering the risk of being late for an important meeting when choosing which train to catch; and so on.
The important thing is that these decisions can be made quickly, without having to visualize every second of the day. Current models of the world do not have such a prediction mechanism.
Dr LeCun has been exploring the potential of a JEPA system since 2022, and in November 2025 he left Meta to tackle the problem full-time. His startup, Advanced Machine Intelligence, plans to turn his ideas into reality, starting with a partnership with Nabla, a health technology startup. He says the goal is a system that uses its own model of the world to determine “what sequence of actions will optimally accomplish a task that I give it.”
But what if these complicated approaches are overkill? If existing generative AI systems can already do useful things in the real world, perhaps they already contain some kind of model of the world within them.
That’s the view of Ilya Sutskever, co-founder of OpenAI, and many of his former colleagues who still work at the lab. Training a large language model, he said in 2023, is nothing more than “learning a model of the world.”
Compressing all the information contained on the internet into a few hundred gigabytes of numbers is only possible if a system "learns" the fundamental principles behind that information.
A fantastic new perspective
There are some indications that he may be right. In 2023, a language model trained on a list of moves in the game Othello was shown to reflect the state of the board within its neural network, even though it had never seen an Othello board and had not been taught the rules of the game.
It was such a detailed representation that the researchers were able to identify specific parts of the neural network that maintained the color of individual stones. This meant they could make specific interventions to change the model's perception of the game, an unprecedented level of control over the calculations of a large language model.
Larger language models are likely to have even more complex models of the world inside them, if researchers could find them. Anthropic, an AI lab, has led research on “interpreting” its Claude models, finding groups of artificial neurons that correspond to everything from feelings of guilt to the Golden Gate Bridge.
And interfering with these structures, as in the example of Othello, causes corresponding changes in the subsequent behavior of these models.
This suggests that the systems are not simply stringing together words: they have a consistent understanding of the physical features of the real world, which they use to answer questions. This sounds a lot like what one would expect from an internal model of the world.
Not everyone agrees. The great linguistic models, argues Dr Li, are simply “wordsmiths in the dark.” The ability to use language to describe the world, she says, does not mean they have a grounded meaning in its reality.
Like a student who has only read about a foreign country, there is a piece of knowledge missing that cannot be filled by books alone, she says. Whatever approach proves most effective, one thing is clear: Artificial Intelligence is ready to visit the real world./The Economist/






















