Is Sa2VA open source?

Yes — bytedance/Sa2VA is open source, released under the Apache-2.0 license.

What language is Sa2VA written in?

bytedance/Sa2VA is primarily written in Python.

How popular is Sa2VA?

bytedance/Sa2VA has 1.6k stars on GitHub.

Where can I find Sa2VA?

bytedance/Sa2VA is on GitHub at https://github.com/bytedance/Sa2VA.

← all repositories

bytedance/Sa2VA

SAM2 and LLaVA share a token space at last

Sa2VA unifies video segmentation and vision-language reasoning by marrying SAM2 with LLaVA in a single model.

★1.6k stars Python Language Models Image · Video · Audio

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

Sa2VA is a unified image-and-video understanding model that combines SAM-2’s segmentation chops with LLaVA’s vision-language reasoning. You can ask it to segment a “girl in a yellow dress” across video frames, or hold a conversation about scene atmosphere, and it grounds the answers in actual pixel masks. It comes in flavors from 1B to 26B parameters, built on InternVL2.5, InternVL3, and Qwen2.5/3 backbones.

The interesting bit

Instead of treating segmentation and language as separate pipelines, Sa2VA shoves text, image, and video tokens into the same LLM space. That shared vocabulary is what lets it handle referring segmentation and open-ended chat with minimal one-shot instruction tuning. The repo also ships training data, evaluation sets, and a converter to HuggingFace format—rare completeness for a research release.

Key highlights

First unified model for dense grounded understanding of both images and videos, per the authors.
Handles referring segmentation and open-ended conversation after minimal one-shot instruction tuning.
Model zoo spans 1B to 26B parameters with multiple backbone options (InternVL2.5, InternVL3, Qwen2.5-VL, Qwen3-VL).
Ships training data, evaluation sets, inference scripts, and a HuggingFace export tool.
Companion projects VRT and SaSaSa2VA (1st place in ICCV 2025 LSVOS RVOS track) live in the same repo.

Caveats

Training the larger variants requires serious hardware; the README suggests at least 8 A100 GPUs.
Dependency management splits across “legacy” and “latest” Transformer extras depending on which backbone you pick, so version compatibility is a manual exercise.
The SA-V video dataset must be downloaded separately from Meta and is not bundled with the provided training data.

Verdict

Worth a look if you research dense video grounding or need a single model that both segments and chats about visual content. Skip it if you just want a drop-in segmentation tool without the LLM baggage.

Frequently asked

What is bytedance/Sa2VA?: Sa2VA unifies video segmentation and vision-language reasoning by marrying SAM2 with LLaVA in a single model.
Is Sa2VA open source?: Yes — bytedance/Sa2VA is open source, released under the Apache-2.0 license.
What language is Sa2VA written in?: bytedance/Sa2VA is primarily written in Python.
How popular is Sa2VA?: bytedance/Sa2VA has 1.6k stars on GitHub.
Where can I find Sa2VA?: bytedance/Sa2VA is on GitHub at https://github.com/bytedance/Sa2VA.