Why the post-LLM bet is on machines that simulate reality, not just text

World Models

World models try to give AI an internal physics — the ability to predict what happens next in the actual world, not just the next token.

LeCun and JEPA

Yann LeCun, Meta's chief AI scientist and a Turing Award winner, has been the loudest critic of pure-LLM scaling. His Open Review essays and his 2022 position paper "A Path Towards Autonomous Machine Intelligence" laid out the framework: text-only autoregressive models are a dead end for general intelligence because they are trained to imitate human language, not to model the world. His proposed alternative is JEPA — Joint Embedding Predictive Architecture. Instead of predicting the next pixel or the next token, JEPA predicts the next latent representation. You take a video, mask out part of it, and train the model to predict an abstract summary of what's missing, not the literal pixels. This sidesteps the problem that dooms generative video models: predicting exact pixels forces the model to hallucinate detail that doesn't matter. Predicting a compact embedding lets it focus on what does. Meta's V-JEPA and I-JEPA papers (2023, 2024) showed this works on video and images. They learn representations that transfer well to downstream tasks without ever generating a pixel. LeCun's claim is that this is the right inductive bias for building agents that understand cause and effect in the physical world.

Predictive coding — the older idea underneath

JEPA isn't out of nowhere. It's a modern instantiation of predictive coding, a neuroscience theory from the 1990s associated with Rajesh Rao, Dana Ballard, and later Karl Friston. The brain is constantly predicting its sensory input, and only the prediction errors propagate upward. Most of what your visual cortex does is suppress signals it already expected. This framing — intelligence as compression by prediction — has been quietly load-bearing across AI for thirty years. Hinton's Helmholtz machines, Schmidhuber's papers on curiosity and compression, Friston's free energy principle, and now LeCun's JEPA all rhyme. The shared intuition: a system that can predict its sensorimotor stream has learned the structure of its environment, and that structure is what we call understanding.

Robotics — where world models actually have to work

Talk is cheap. Robots don't get to hallucinate. This is why the most aggressive world-model work is happening at robotics labs. Tesla Optimus, Figure AI, and 1X Technologies are all building humanoid robots that need to plan multi-step actions in cluttered physical environments. The dominant approach has shifted from hand-engineered control to learned policies, and the bottleneck is data. You can't pretrain on the entire internet for embodied tasks. So these labs are doing two things: building massive teleoperation pipelines to collect demonstration data, and training world models that let robots plan in imagination instead of in the real world. The pattern: a robot equipped with a good world model can try a thousand candidate actions inside its own head, score them, and execute only the best one. This is essentially what AlphaGo did with Monte Carlo Tree Search, except the model of the world is learned rather than given. Tesla has publicly described training a "world simulator" on Optimus video. Figure's Helix architecture combines a vision-language reasoning module with a fast motor controller, with implicit world-model structure in between. 1X has shown autonomous home navigation that clearly involves predictive forward models.

Genie — DeepMind's generative world model line

DeepMind has taken the world-model bet in a different direction. Genie (2024) and Genie 2 (late 2024) and Genie 3 (2025) are generative interactive environments. You give the model an image — a single frame, sometimes a hand-drawn sketch — and it generates a playable world from it. You can press arrow keys, walk around, interact with objects, and the model produces consistent, controllable video in real time. This is striking for two reasons. First, the model has clearly learned implicit physics: objects fall, water flows, doors block movement. Second, it's a foundation model for environments — instead of training agents in hand-built simulators, you can generate the simulator itself from a prompt. Genie 3 in particular pushed coherence over minutes rather than seconds, which is the regime where this stops being a demo and starts being useful for agent training. OpenAI's Sora, Google's Veo, and Runway's Gen-3 are sometimes pitched as world models, but most researchers distinguish them: those are video generators optimized for visual quality, not necessarily for action-conditioned consistency. The world-model crown is whoever can let you steer the future of the video and have physics hold up.

Why text-only LLMs hit a ceiling

The provocation underneath all of this: GPT-4, Claude, Gemini are extraordinary at text and increasingly at images, but they have never been embodied. They know about gravity from billions of words written about gravity. They don't know it the way a toddler who dropped a spoon knows it. This shows up in concrete failures. LLMs are weirdly bad at spatial reasoning, at counting objects in cluttered scenes, at predicting how an unfamiliar tool will behave, at any task where the right answer requires running a mental simulation rather than retrieving a pattern. They confabulate confidently about physical scenarios because they've never been corrected by reality. The world-model bet is that closing this gap requires training paradigms where the model is grounded — either through video at massive scale, through robot data, through interactive environments like Genie, or through some combination. Pure next-token prediction on text has, in this view, an asymptote.

The post-LLM bet

Not everyone buys it. The scaling camp argues that LLMs will continue to improve, that multimodal training is folding in vision and video naturally, and that world models are an elaborate name for "video prediction with extra steps." They point to GPT-4's emergent physical reasoning and to the fact that text contains more world structure than skeptics credit. The world-model camp, led by LeCun but increasingly populated by robotics labs and DeepMind, argues that the next leap won't come from bigger text models. It will come from systems that learn the structure of the world the way animals do — through prediction, through action, through being wrong about reality and adjusting. Both bets are being made simultaneously with enormous capital. The next few years will tell us which one was right, or whether — as is usually the case in AI — both turn out to be partially correct and the real architecture is something neither side has named yet.

← atlas index