zai-org/CogVLM
CogVLM is an open-source visual language model with 17B parameters supporting image understanding and multi-turn dialogue.

Velocity · 7d
+6.8
★ / day
Trend
→steady
star history
CogVLM is a multimodal pretrained visual language model achieving state-of-the-art on 10 cross-modal benchmarks including captioning, VQA, and referring tasks. CogAgent extends CogVLM with 18B parameters and adds GUI agent capabilities for autonomous screen operation tasks. Both models use a visual expert architecture to align visual and language representations, supporting high-resolution image understanding.