Beyond LLMs: JEPA, search, and the next shape of AI

Published 2026-05-12·Updated 2026-05-12·v1·#ai#llms#jepa#deepmind#reinforcement-learning#world-models#ai-architecture#deep-learning

Large language models have become the default mental picture of AI.

Ask most people what frontier AI means and they imagine a chat box: tokens flowing left to right, a transformer predicting the next word, a system that can write, summarize, code, role-play, and reason just well enough to feel uncanny.

That picture is not wrong. LLMs are one of the great engineering breakthroughs of the last decade. But it is incomplete.

If the goal is useful intelligence — systems that understand the physical world, plan over time, discover new knowledge, act safely, and improve through experience — language modeling is only one path through a larger design space.

Two families of ideas are especially useful for thinking beyond LLMs: JEPA, associated with Yann LeCun and Meta, and the search-plus-reinforcement-learning lineage associated with Demis Hassabis and DeepMind.

They start from different instincts. Both challenge the idea that intelligence is just next-token prediction at scale.

the LLM baseline

An LLM learns by predicting tokens. This is shockingly powerful because text is not just text. Text contains facts, procedures, arguments, code, stories, plans, equations, and compressed records of civilization.

But language is also a bottleneck.

A child does not learn gravity by reading “objects fall when unsupported.” She learns by watching cups drop, blocks stack, liquids spill, dogs move, adults reach, doors swing, and bodies collide. Human intelligence is grounded in perception and action before it is verbalized.

LLMs can imitate some of this because humans wrote about it. But they do not naturally learn from raw experience in the way animals do. They do not maintain grounded world models by default. They do not test actions in the physical world unless we connect them to tools, simulators, or robots.

That is where JEPA becomes interesting.

what JEPA is

JEPA stands for Joint-Embedding Predictive Architecture.

The basic idea is simple: instead of predicting raw data, predict an abstract representation of the missing or future part of the data.

Imagine a model looking at a video where part of the frame is hidden. A pixel-prediction model tries to reconstruct every missing pixel: wallpaper texture, shadow noise, irrelevant camera details. JEPA says: do not waste capacity predicting every low-level detail. Predict the meaningful representation.

What object is there? What state is it in? What kind of motion makes sense? What latent structure explains the scene?

I-JEPA applied the idea to images. V-JEPA extended it to video. The target is not pixel-perfect reconstruction. The target is a latent embedding — a compressed state that should preserve what matters.

This is closer to how humans often understand scenes. If a person walks behind a wall, you do not model the exact photons behind the wall. You maintain a belief: a person is behind the wall, still moving, likely to reappear.

JEPA is prediction without hallucinating the wallpaper.

where it came from

JEPA sits inside Yann LeCun’s broader argument about autonomous machine intelligence. In his “A Path Towards Autonomous Machine Intelligence,” LeCun argues that future AI needs world models, self-supervised predictive learning, memory, planning, and objectives that operate beyond pure autoregressive token prediction.

Meta’s I-JEPA and V-JEPA are concrete steps in that direction. They try to learn useful visual and video representations from observation, without requiring labels and without reconstructing every pixel.

The intuition is powerful: an intelligent system should learn the structure of the world by predicting in abstraction space.

JEPA versus LLMs

The contrast is clean, though not absolute.

LLMs learn primarily from symbolic sequences. JEPA learns from perceptual streams like images and video.

LLMs predict tokens. JEPA predicts latent representations.

LLMs are generative by default. JEPA is representation-first.

LLMs learn from human-produced artifacts. JEPA tries to learn more directly from the world’s structure.

The upside is obvious. A strong JEPA-like world model could learn from vast quantities of unlabeled video. It could develop intuitive physics, object permanence, causality, affordances, and temporal abstraction. Because it predicts in representation space, it may avoid wasting effort on irrelevant detail.

But representation learning is not the same as full intelligence.

A JEPA system may learn excellent embeddings and still need memory, goals, planning, uncertainty estimation, action, and self-correction. Passive video may not reveal enough causality. To understand interventions — push, pull, pour, cut, rotate — an agent may need action-conditioned experience.

Evaluation is also hard. LLMs are easy to probe through language. World models are harder. If a video model “understands” physics, what proves it? Robotics transfer? Counterfactual prediction? Long-horizon planning?

So JEPA is not a replacement for LLMs. It is a candidate component for a more grounded architecture.

the DeepMind route: intuition plus search

Demis Hassabis’s career points to another route beyond pure LLMs: combine neural networks with search, reinforcement learning, and simulation.

AlphaGo is the canonical example. It did not win Go by chatting about Go. It combined deep neural networks with Monte Carlo Tree Search. One network supplied intuition about promising moves. Another evaluated board positions. Search explored possible futures, guided by the networks.

That pattern is elegant:

neural intuition narrows the search space
planning tests possible futures
feedback improves the intuition

AlphaGo Zero sharpened it. Instead of learning from human games, it learned through self-play. AlphaZero generalized the recipe to chess and shogi. MuZero learned a model useful for planning without being given the full environment rules.

This is AI as optimization through simulated experience, not AI as autocomplete.

search as an intelligence primitive

Search sounds old-fashioned until you notice how often intelligence means choosing among possible futures.

In a game, the future is a tree of moves. In protein folding, it is a landscape of structural possibilities. In code optimization, it is a space of candidate programs. In robotics, it is a set of action trajectories. In scientific discovery, it is a hypothesis space.

Search is the act of exploring that space intelligently.

The DeepMind recipe is not brute force. Pure brute force is too expensive. The key is learned guidance. Neural networks provide priors: this move looks promising, this structure is plausible, this action may work. Search then refines those priors by looking ahead.

Humans do something similar. A chess master does not examine every legal move. She sees promising moves quickly, then calculates. A scientist does not test every molecule. She uses theory and intuition to propose candidates, then experiments.

This “intuition plus planning” pattern remains one of the strongest alternatives to pure language modeling.

AlphaFold and search as optimization

AlphaFold is not the same algorithm as AlphaGo, but it belongs to the same worldview: represent a huge structured possibility space and optimize through it.

Protein folding is hard because the number of possible configurations is enormous. AlphaFold2 used deep learning, evolutionary information, geometry, attention, and iterative refinement to predict protein structures at remarkable accuracy. AlphaFold3 extended the ambition toward biomolecular complexes and interactions.

The lesson is not that every problem needs Monte Carlo Tree Search. The lesson is that many hard problems can be framed as search or optimization over structured spaces.

Sometimes the search is explicit, as in game trees. Sometimes it is embedded in iterative refinement. Sometimes it is amortized into a network. But the pattern is similar: represent possibilities, score them, refine toward better solutions.

That is a bigger vision than chatbots.

value and limits of the Hassabis approach

The value-add is obvious.

Search and reinforcement learning can discover strategies not present in human data. AlphaGo Zero found ideas that surprised experts. AlphaDev used reinforcement learning to discover faster sorting algorithms. Self-play can create its own curriculum. Simulation can generate enormous training data. Planning can be more reliable than one-shot generation.

The limitations are also real.

These systems work best when there is a clear objective. Games have win conditions. Sorting algorithms have correctness and speed. Protein structures have physical and evolutionary constraints. Open-ended real-world tasks are messier.

They often need a simulator or model. If the simulator is wrong, search exploits the wrong world.

Search is computationally expensive. Looking ahead costs time and money.

Reward design can also go wrong. Reinforcement learning optimizes what you specify, not what you meant.

Still, the DeepMind lineage proves something important: intelligence can be built through interaction, planning, feedback, and optimization — not only imitation.

other real non-LLM approaches

JEPA and DeepMind-style search are not alone.

Diffusion models generate images, audio, video, molecules, and robot actions by learning to denoise. Diffusion Policy applies diffusion to robot control, modeling action trajectories instead of text.

World-model agents like Dreamer learn latent dynamics and train policies inside imagined rollouts. This is close in spirit to JEPA but more action-oriented.

Evolutionary algorithms and quality-diversity methods search over populations of solutions. They are useful when gradients are unavailable or objectives are deceptive.

GFlowNets learn to sample diverse compositional objects proportional to reward, which is interesting for molecules and other domains where you want many good candidates, not only the single best one.

Neurosymbolic systems combine neural perception with explicit rules, programs, or logic. They remain hard to scale, but they address a real weakness of neural-only systems: verifiable structure.

Energy-based models define compatibility or “energy” between variables rather than directly generating outputs. In principle, they can represent constraints and uncertainty in flexible ways.

The future probably does not belong to one of these approaches alone. It belongs to systems that compose them.

toward a post-LLM architecture

A plausible next-generation AI system may look less like a chatbot and more like a cognitive operating system.

It will still have an LLM-like language interface because language is the API of human collaboration. But behind that interface, it may also have perceptual world models trained from video and interaction, search when consequences matter, simulation before action, reinforcement learning from feedback, retrieval from memory, and symbolic tools for precision.

In that architecture, LLMs are not obsolete. They are the narrative and coordination layer. They translate goals, explain plans, write code, and communicate with humans.

But they are not the whole mind.

JEPA points toward grounded predictive understanding. DeepMind-style search points toward planning and optimization. Diffusion points toward iterative generation. World models point toward imagination. Graphs and symbolic tools point toward structure.

The question is not “what beats LLMs?”

The better question is: what do LLMs need around them?

A useful AI system needs more than fluent text. It needs memory, retrieval, planning, verification, feedback, and loops that improve over time.

The frontier is not just bigger models. It is better loops: perception into representation, representation into prediction, prediction into planning, planning into action, action into feedback, feedback into better models.

That loop, more than any single architecture, is what intelligence is made of.

Linked from

← World models