MoonshotAI/Kimi-VL
Open-source Mixture-of-Experts vision-language model with 128K context window and autonomous agent capabilities.

Kimi-VL is an efficient open-source vision-language model (VLM) using Mixture-of-Experts architecture in its language decoder, activating only 2.8B parameters. It supports multimodal reasoning, long-context understanding up to 128K tokens, and demonstrates strong agent capabilities in multi-turn interactions such as OSWorld. The model includes a native-resolution vision encoder (MoonViT) and achieves competitive performance against GPT-4o-mini, Qwen2.5-VL-7B, and other efficient VLMs.