What is JEPA? - The Next Big Thing on the path to Super Intelligence

Written by Kevin Smith | Jul 23, 2025 11:00:00 PM

We’ve spent the last few years riding a generative wave — text-to-image, code completion, chat interfaces that write poetry. Large Language Models (LLMs) have become the poster children of artificial intelligence. Their ability to reason, summarise, and synthesise across vast domains is remarkable. But despite their linguistic brilliance, they often struggle with tasks that require grounding in physical reality — like solving a spatial logic puzzle, navigating an embodied environment, or interpreting a video sequence in context. These are tasks where seeing, sensing, and predicting the real world become essential.

These models are brilliant mimics, but they don’t really understand the world. They can’t see it. They can’t feel it. And they certainly can’t predict how it might change.

Enter JEPA — the Joint Embedding Predictive Architecture — a new class of AI model being pioneered by Meta AI and its Chief AI Scientist, Yann LeCun. It doesn’t work like ChatGPT. It doesn’t try to autocomplete your sentences or generate photorealistic art. Instead, it learns like we do: by watching, predicting, and abstracting meaning from the world around it.

JEPA may be the key to building AI that doesn’t just generate — it understands.

From Language Models to World Models

Let’s start with the obvious: LLMs are powerful. They’ve revolutionised everything from customer service to coding. But they’re also confined by their training data. They’ve read the internet, sure — but they’ve never seen a cat knock a glass off a table or a toddler learn to walk. They don’t know gravity. Or persistence of objects. Or that a kettle filled with water will get heavier than one that’s empty.

This is the difference between reading about the world and experiencing it.

LeCun has long argued that for AI to reach human-level intelligence, it must build something called a “world model” — an internal representation of how the world works. Think of it as common sense meets intuitive physics. It's how humans can predict what happens next in a video, or imagine what’s behind a door, or plan several steps ahead in a new situation.

To build world models at scale, Meta has been developing JEPA — a family of models that learn not by generating text or images, but by predicting abstract representations of missing information in sensory inputs like images or video. It’s a subtle but profound shift.

What is JEPA, and Why Is It Different?

Traditional generative models — like diffusion models for images or autoregressive models for text — try to predict the exact pixel or word that comes next. That’s great for fidelity, but it comes with baggage. Generating every fine detail is expensive, and often unnecessary. Do we really need to model every blade of grass to understand there’s a field?

JEPA takes a different approach. It masks part of an image or video, and instead of trying to reconstruct it pixel by pixel, it learns to predict the abstract representation of the missing content. That is, it guesses the meaning of what’s missing, not the exact look.

By operating in a latent space — a high-dimensional embedding that captures semantic content — JEPA avoids the trap of needing to model every fine-grained detail. It focuses on the structure and semantics, not the noise. This makes it more robust, more efficient, and arguably, more human-like.

I-JEPA: Learning to See

The first JEPA implementation, I-JEPA, launched in 2023 as a vision model trained entirely through self-supervision. It works like this:

Feed it an image.
Mask out a portion (say, the lower-right corner).
Ask the model to predict what that missing part looks like in abstract feature space.
Train it by comparing its prediction to a target encoder’s actual representation of the masked area.

The result? A model that learns rich visual features — not because someone labelled the data, but because it’s trying to understand what should be there based on context.

It’s a bit like looking at a jigsaw puzzle with a few pieces missing and still knowing it’s a dog in a park. You don’t need every piece to get the picture. I-JEPA learns that same trick.

What’s more, I-JEPA performs surprisingly well. On benchmarks like ImageNet, it competes with state-of-the-art models — and it does so using a fraction of the compute and training time. According to Meta, I-JEPA used roughly one-tenth the compute of a comparable generative vision model, thanks to its abstract prediction approach (source).

It’s also far less reliant on hand-crafted data augmentations or contrastive sampling tricks. It learns directly from the data structure — a cleaner, more scalable method.

V-JEPA: Watching to Learn

If I-JEPA is about static images, then V-JEPA takes the same principle to video.

V-JEPA masks part of a video — say, a few frames in the middle — and tries to predict the missing temporal segment’s abstract representation. This forces the model to understand how things move, change, and interact over time.

In 2024, Meta introduced V-JEPA 2, trained on over a million hours of internet-scale video. This monster model learned not just what things are, but how they behave — capturing motion, cause and effect, and visual dynamics.

One of the most impressive results came when V-JEPA 2 was post-trained on just 62 hours of robot data. With no further fine-tuning, it was able to control a robotic arm (Franka) to pick and place objects in real-world settings — simply by giving it a goal image. That’s zero-shot generalisation in robotics — a huge leap forward (source).

To be clear: the robot had never seen that table or object arrangement before. But because V-JEPA 2 had seen enough of the world via video, it could imagine what needed to happen and plan accordingly.

That’s not just impressive — it’s foundational. It suggests we’re getting closer to robots that learn like children: by watching.

JEPA and LLMs: Rivals or Teammates?

So where does this leave LLMs? Are we done with GPT-style models?

Not quite. JEPA isn’t replacing LLMs — it’s complementing them.

LLMs are brilliant at abstract symbolic reasoning. They understand and generate language, they can recall facts, and they can follow instructions. But they lack grounding. They’ve never seen a dog, or felt heat, or watched someone fall off a bike.

JEPA fills in that sensory gap. It gives AI a way to learn from perception, to build the kinds of models of the world that humans use all the time.

The most powerful systems will be hybrids. In fact, Meta already combined V-JEPA 2 with a language model to create a system that can answer questions about video content — essentially adding eyes and common sense to an LLM.

Think of it this way: the LLM is the talker, JEPA is the seer. One handles words, the other handles reality. Together, they make a smarter agent.

The Race for World Models

Meta isn’t the only one in the game. NVIDIA recently announced its Cosmos models — a family of “world foundation models” trained on over 20 million hours of video, including driving footage and robotics demos. Cosmos aims to generate physically plausible video simulations to help train and test autonomous systems.

While Cosmos leans more towards generative simulation — literally producing future video frames — JEPA bets on latent abstraction. It’s not about drawing what’s next; it’s about knowing what’s likely.

Both approaches are valid. And both point to a future where models don’t just parrot back what they’ve seen — they learn how the world works.

Why This Matters for Business and Beyond

JEPA might sound academic, but it has massive real-world implications.

In robotics, it unlocks new levels of autonomy. Robots that learn by watching YouTube could adapt to your kitchen without custom training. In healthcare, models trained via JEPA could predict unseen structures in medical scans with less data. In autonomous driving, JEPA-style systems could better anticipate pedestrian movement or occluded vehicles.

But perhaps the most exciting area is embodied AI — agents that see, plan, and act in the physical world. From AR assistants to warehouse robots, we’ll need models that can understand rather than memorise.

JEPA provides the scaffolding for that understanding.

👉 What is Agentic Computing?

This article was originally written and published on LinkedIn by Kevin Smith, CTO and founder of Dootrix.

View full post