ByteDance-Seed/Seed1.5-VL

ByteDance's vision-language foundation model combining a 532M vision encoder with a 20B parameter MoE LLM for multimodal understanding and reasoning.

★1.6k stars Jupyter Notebook Language Models Image · Video · Audio

View on GitHub ↗ Homepage ↗

Velocity · 7d

+4.0

★ / day

Trend

→steady

star history

Seed1.5-VL is a general-purpose vision-language foundation model designed for advanced multimodal understanding and reasoning. It combines a 532M-parameter vision encoder with a 20B active parameter mixture-of-experts language model to achieve state-of-the-art performance across diverse benchmarks including OCR, diagram understanding, visual grounding, 3D spatial reasoning, video comprehension, and agent-centric tasks like GUI control and gameplay. The repository provides usage cookbooks and best practices for developers.