← all repositories
zai-org/ChatGLM2-6B

A 6B-parameter LLM that runs on a GTX 1060

ChatGLM2-6B squeezes bilingual chat into consumer GPUs through aggressive quantization and attention tricks.

15.6k stars Python Language ModelsChat Assistants
ChatGLM2-6B
Velocity · 7d
+14
★ / day
Trend
steady
star history

What it does ChatGLM2-6B is a 6-billion-parameter Chinese-English dialogue model from Tsinghua’s THUDM lab. It packs INT4 quantization tight enough to run on 6GB VRAM, supports 8K context in chat (32K in a separate variant), and ships with P-Tuning v2 and full fine-tuning scripts. The repo is essentially a HuggingFace wrapper plus inference examples and evaluation benchmarks.

The interesting bit The real engineering is in the memory math, not the parameter count. Multi-Query Attention slashes KV cache size, and the model reuses previous turns’ caches via Causal Mask training—so a conversation that would asphyxiate the first-gen ChatGLM-6B after 1,119 characters keeps breathing past 8,192 on the same 6GB card. That’s a 7× context win without buying new hardware.

Key highlights

  • INT4 inference needs only 5.1–5.5 GB VRAM; INT8 sits at ~8 GB
  • Benchmark jumps are lopsided: +571% on GSM8K math versus ChatGLM-6B, though MMLU gains a more modest +23%
  • FlashAttention-backed context stretching to 32K in the dedicated -32K variant
  • Free commercial use after a registration questionnaire; fully open for research
  • PyTorch 2.0+ strongly recommended—older versions fall back to naive attention and balloon memory

Caveats

  • The README is upfront that 6B parameters means hallucination and misleading outputs are guaranteed risks, not edge cases
  • The team explicitly disclaims any downstream apps; you’re on your own for safety review and deployment
  • Quantization costs a few points on MMLU and C-Eval, though the README calls it “acceptable”

Verdict Worth a look if you need bilingual Chinese-English chat on constrained hardware or want a permissively licensed base to fine-tune. Skip it if you need state-of-the-art reasoning out of the box—those GSM8K gains start from a very low baseline.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.