Is CogVLM open source?

Yes — zai-org/CogVLM is open source, released under the Apache-2.0 license.

What language is CogVLM written in?

zai-org/CogVLM is primarily written in Python.

How popular is CogVLM?

zai-org/CogVLM has 6.7k stars on GitHub.

Where can I find CogVLM?

zai-org/CogVLM is on GitHub at https://github.com/zai-org/CogVLM.

← all repositories

zai-org/CogVLM

An open vision-language model that can chat, caption, and click

CogVLM and CogAgent bind high-resolution image understanding to language reasoning, with CogAgent adding visual GUI automation to the mix.

★6.7k stars Python Language Models Agents

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

CogVLM-17B is an open visual-language model that ingests 490×490 images and carries on multi-turn conversations about them, from captioning to fine-grained visual grounding. CogAgent-18B raises the resolution to 1120×1120 and adds a GUI-agent mode: it reads screenshots and predicts interface actions. Both variants can run through SAT or Hugging Face pipelines, and the repo includes an OpenAI-style vision API wrapper.

The interesting bit

Rather than bolting a tiny vision encoder onto a large language model, CogVLM dedicates 10 billion visual parameters (CogAgent uses 11 billion) to a visual expert layer alongside 7 billion language parameters. CogAgent then retargets that visual capacity toward operating software interfaces, scoring high on GUI benchmarks like AITW and Mind2Web.

Key highlights

The authors report state-of-the-art results on ten cross-modal benchmarks for CogVLM-17B (NoCaps, RefCOCO, GQA, etc.).
CogAgent-18B is claimed to hit state-of-the-art generalist performance on nine benchmarks and to surpass existing models on GUI datasets.
4-bit quantized inference fits into roughly 11 GB of GPU memory.
Bilingual Chinese/English capability is supported.
Fine-tuning recipes via LoRA and an OpenAI-compatible vision API are provided.

Caveats

The README is mostly quick-start scripts; deeper architecture notes live in a separate Chinese-language wiki.
CogVLM2, the newer Llama-3-based successor, is developed in a separate repository.
The README immediately offers 4-bit quantization and model-parallel sharding across up to eight GPUs, suggesting full-precision inference is not the default path.

Verdict

Great for researchers and hackers prototyping multimodal chat, document understanding, or desktop automation. Skip it if you need a tiny, edge-deployable vision model—the 17B-parameter floor makes that impossible.

Frequently asked

What is zai-org/CogVLM?: CogVLM and CogAgent bind high-resolution image understanding to language reasoning, with CogAgent adding visual GUI automation to the mix.
Is CogVLM open source?: Yes — zai-org/CogVLM is open source, released under the Apache-2.0 license.
What language is CogVLM written in?: zai-org/CogVLM is primarily written in Python.
How popular is CogVLM?: zai-org/CogVLM has 6.7k stars on GitHub.
Where can I find CogVLM?: zai-org/CogVLM is on GitHub at https://github.com/zai-org/CogVLM.