A world model that trains on one GPU and actually stays stable
LeWorldModel cuts JEPA training from six loss hyperparameters to one, then plans 48× faster than foundation-model competitors.

What it does LeWorldModel (LeWM) is a ~15M-parameter world model that learns to predict future states from raw pixels using a Joint-Embedding Predictive Architecture. It trains end-to-end on a single GPU in a few hours, then plans via model-predictive control across 2D and 3D robotics tasks. The authors include Yann LeCun and team.
The interesting bit Most JEPAs are brittle: they need exponential moving averages, pretrained encoders, or baroque multi-term losses to stop representations from collapsing. LeWM stays stable with just two loss terms—next-embedding prediction plus a Gaussian regularizer on latents. That drops tunable loss hyperparameters from six to one versus the only other end-to-end JEPA alternative.
Key highlights
- Trains from raw pixels with no auxiliary supervision or frozen encoders
- Plans up to 48× faster than foundation-model-based world models per the authors’ benchmarks
- Latent space probes reveal it encodes actual physical quantities (mass, position, etc.)
- “Surprise evaluation” detects physically implausible events reliably
- Pretrained checkpoints available for four environments (PushT, cube, two-room, reacher) on Hugging Face
Caveats
- The repo itself is mostly the architecture and loss; training, envs, and evaluation live in two sibling repositories (
stable-worldmodel,stable-pretraining), so this is glue code with the core model inside - Loading Hugging Face checkpoints requires a manual conversion script to produce the
_object.ckptformat that evaluation expects
Verdict Worth a look if you’re researching sample-efficient world models or JEPA stability, especially with limited GPU budgets. Skip if you need a batteries-included training framework—this repo assumes you’ll clone the dependency stack.