bytedance/Sa2VA
ByteDance's open-source multimodal LLM codebase combining SAM2 segmentation with LLaVA-style vision-language models for dense visual understanding of images and videos.

Sa2VA merges Segment Anything Model 2 (SAM2) with Large Language and Vision Assistant (LLaVA) architectures to enable dense grounded understanding across images and videos. The codebase includes training pipelines, pretrained model weights, and datasets such as Ref-SAM-v for multimodal instruction tuning. It has achieved state-of-the-art results including 1st place in the ICCV 2025 LSVOS Challenge RVOS Track.