zai-org/CogVLM2
Open-source multi-modal LLM combining vision and language understanding based on Llama3-8B.

Velocity · 7d
+3.2
★ / day
Trend
→steady
star history
CogVLM2 is a GPT4V-level open-source multi-modal model that integrates visual and language capabilities. The model supports image understanding and extends to video comprehension through keyframe extraction, handling videos up to 1 minute. It offers multiple deployment options including TGI inference and INT4 quantized versions requiring only 16GB VRAM.