A 6B-parameter Chinese chatbot that fits on a budget GPU
ChatGLM-6B squeezes bilingual dialogue into consumer hardware via aggressive quantization, then gives the weights away for free.

What it does
ChatGLM-6B is a 6.2-billion-parameter bilingual (Chinese/English) dialogue model built on the GLM architecture. It runs locally with as little as 6 GB of VRAM using INT4 quantization, and supports fine-tuning down to 7 GB. The weights are fully open for academic use and free for commercial use after a registration survey.
The interesting bit
The project treats hardware constraints as a first-class design problem rather than an afterthought. The README leads with a quantization matrix showing exactly how much GPU memory you need for each mode—FP16, INT8, INT4—plus a separate column for fine-tuning. That transparency is rarer than it should be in LLM land.
Key highlights
- INT4 quantization enables inference on 6 GB VRAM and fine-tuning on 7 GB
- Includes P-Tuning v2 for efficient parameter customization
- Trained on ~1T tokens with supervised fine-tuning, feedback bootstrapping, and RLHF
- Weights free for research; commercial use allowed after survey registration
- Ecosystem of third-party ports: C++ inference (MNN, InferLLM), CPU-only running (JittorLLMs), langchain integration, VS Code plugins
Caveats
- The model is explicitly described as small, probabilistic, and “easily misled”; the authors disclaim responsibility for misuse
- The team notes they have built no official apps (web, iOS, Android, Windows)—everything is community-driven
- README is now largely a billboard for newer GLM-4 models, with ChatGLM-6B itself in maintenance mode
Verdict
Worth a look if you need a permissively licensed Chinese-English chatbot that actually fits on modest hardware. Skip it if you want state-of-the-art performance or an actively developed upstream—the project has moved on to GLM-4.