ByteDance-Seed/Seed1.5-VL
ByteDance's vision-language foundation model combining a 532M vision encoder with a 20B parameter MoE LLM for multimodal understanding and reasoning.

Seed1.5-VL is a general-purpose vision-language foundation model designed for advanced multimodal understanding and reasoning. It combines a 532M-parameter vision encoder with a 20B active parameter mixture-of-experts language model to achieve state-of-the-art performance across diverse benchmarks including OCR, diagram understanding, visual grounding, 3D spatial reasoning, video comprehension, and agent-centric tasks like GUI control and gameplay. The repository provides usage cookbooks and best practices for developers.