facebookresearch/jepa
PyTorch implementation of V-JEPA, a self-supervised learning method for visual representations from video developed by Meta AI Research.

V-JEPA (Video Joint Embedding Predictive Architecture) is a self-supervised learning approach that trains vision transformers by predicting latent feature representations from masked video regions. The method trains purely on video pixels without using pretrained image encoders, text, or human annotations. The codebase provides pretrained models and training scripts for learning versatile visual representations that transfer well to downstream video and image classification tasks using frozen backbones with lightweight probes.