← all repositories
facebookresearch/sam3

Meta's SAM 3: segmentation by text, not just clicks

A foundation model that segments images and videos using open-vocabulary text prompts like "a player in white."

sam3
Velocity · 7d
+32
★ / day
Trend
steady
star history

What it does

SAM 3 is Meta’s latest segmentation foundation model for images and videos. You can prompt it with text (“all instances of a concept”), points, boxes, masks, or image exemplars, and it returns segmentation masks plus bounding boxes. It also tracks objects through video. The repo provides inference code, fine-tuning scripts, model checkpoints via Hugging Face, and a pile of Jupyter notebooks.

The interesting bit

The leap from SAM 2 is open-vocabulary concept segmentation at scale. SAM 3 was trained on a data engine that auto-annotated 4 million unique concepts, and it hits 75–80% of human performance on Meta’s new SA-Co benchmark covering 270K concepts. A “presence token” in the architecture helps the model distinguish fine-grained text prompts—think “player in white” versus “player in red”—without getting confused.

Key highlights

  • 848M parameters; detector + tracker share a vision encoder, with the detector built on DETR and the tracker inheriting SAM 2’s transformer design
  • Supports both image and video inference; video uses a session-based API with start_session and add_prompt requests
  • New SAM 3.1 checkpoints (March 2026) add a shared-memory multi-object tracking mode that is “significantly faster without sacrificing accuracy”
  • Requires Python 3.12+, PyTorch 2.7+, CUDA 12.6+, and you must request Hugging Face access to download checkpoints
  • Optional flash-attention and custom CUDA kernels available for faster inference

Caveats

  • Checkpoints are gated behind Hugging Face approval; you cannot just wget them
  • The README’s benchmark table is truncated, so full comparison numbers against prior work are incomplete in the source
  • Video API is request-dictionary based rather than a simple function call, which adds boilerplate

Verdict

Worth a look if you need segmentation driven by natural language rather than manual clicking, especially for video. Skip it if you want a lightweight, zero-dependency model or immediate checkpoint access without paperwork.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.