← all repositories

bytedance/Sa2VA

ByteDance's open-source multimodal LLM codebase combining SAM2 segmentation with LLaVA-style vision-language models for dense visual understanding of images and videos.

Sa2VA
Velocity · 7d
+3.1
★ / day
Trend
steady
star history

Sa2VA merges Segment Anything Model 2 (SAM2) with Large Language and Vision Assistant (LLaVA) architectures to enable dense grounded understanding across images and videos. The codebase includes training pipelines, pretrained model weights, and datasets such as Ref-SAM-v for multimodal instruction tuning. It has achieved state-of-the-art results including 1st place in the ICCV 2025 LSVOS Challenge RVOS Track.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.