Meta's video segmentation model now tracks objects through time
SAM 2 extends the original Segment Anything to video with streaming memory, turning one-off image masks into persistent object tracking.

What it does
SAM 2 is a foundation model for promptable visual segmentation in both images and videos. You point at or box an object in a frame, and the model generates a mask—then, for video, it propagates that mask forward through time, tracking the object as it moves, occludes, and reappears. The API is deliberately familiar to SAM users: SAM2ImagePredictor for static images, SAM2VideoPredictor for sequences, with an inference state object that remembers your prompts across frames.
The interesting bit
The architecture treats images as single-frame videos, unifying both modes in one transformer with streaming memory. A recent December 2024 update added full torch.compile support for video inference and independent per-object tracking, so you can add new objects after tracking starts without re-prompting everything.
Key highlights
- Four checkpoint sizes from 39M to 224M parameters, trading accuracy for speed (91 FPS down to 39 FPS on video benchmarks)
- Training and fine-tuning code released as of September 2024, plus the full web demo frontend/backend
- Loads directly from Hugging Face Hub without manual checkpoint management
- Custom CUDA kernel compilation during install; the README notes you can ignore build failures and still run inference
- Example notebooks provided for image prediction, automatic mask generation, and video tracking with click/box prompts
Caveats
- Requires PyTorch ≥2.5.1, Python ≥3.10, and an NVIDIA GPU with matching CUDA toolkit for full functionality
- Windows users are steered toward WSL; native Windows support appears limited
- The “failed to build CUDA extension” message during install is apparently common enough that the README explicitly tells you to ignore it
Verdict Computer vision researchers and engineers building video analysis pipelines should grab this. If you’re only doing occasional static-image segmentation and already have SAM 1 running, the upgrade is nice but not urgent.