← all repositories
facebookresearch/sam2

Meta's video segmentation model now tracks objects through time

SAM 2 extends the original Segment Anything to video with streaming memory, turning one-off image masks into persistent object tracking.

19.3k stars Jupyter Notebook Computer VisionInference · Serving
sam2
Velocity · 7d
+28
★ / day
Trend
steady
star history

What it does SAM 2 is a foundation model for promptable visual segmentation in both images and videos. You point at or box an object in a frame, and the model generates a mask—then, for video, it propagates that mask forward through time, tracking the object as it moves, occludes, and reappears. The API is deliberately familiar to SAM users: SAM2ImagePredictor for static images, SAM2VideoPredictor for sequences, with an inference state object that remembers your prompts across frames.

The interesting bit The architecture treats images as single-frame videos, unifying both modes in one transformer with streaming memory. A recent December 2024 update added full torch.compile support for video inference and independent per-object tracking, so you can add new objects after tracking starts without re-prompting everything.

Key highlights

  • Four checkpoint sizes from 39M to 224M parameters, trading accuracy for speed (91 FPS down to 39 FPS on video benchmarks)
  • Training and fine-tuning code released as of September 2024, plus the full web demo frontend/backend
  • Loads directly from Hugging Face Hub without manual checkpoint management
  • Custom CUDA kernel compilation during install; the README notes you can ignore build failures and still run inference
  • Example notebooks provided for image prediction, automatic mask generation, and video tracking with click/box prompts

Caveats

  • Requires PyTorch ≥2.5.1, Python ≥3.10, and an NVIDIA GPU with matching CUDA toolkit for full functionality
  • Windows users are steered toward WSL; native Windows support appears limited
  • The “failed to build CUDA extension” message during install is apparently common enough that the README explicitly tells you to ignore it

Verdict Computer vision researchers and engineers building video analysis pipelines should grab this. If you’re only doing occasional static-image segmentation and already have SAM 1 running, the upgrade is nice but not urgent.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.