Is ChatLM-mini-Chinese open source?

Yes — charent/ChatLM-mini-Chinese is open source, released under the Apache-2.0 license.

What language is ChatLM-mini-Chinese written in?

charent/ChatLM-mini-Chinese is primarily written in Python.

How popular is ChatLM-mini-Chinese?

charent/ChatLM-mini-Chinese has 1.7k stars on GitHub.

Where can I find ChatLM-mini-Chinese?

charent/ChatLM-mini-Chinese is on GitHub at https://github.com/charent/ChatLM-mini-Chinese.

← all repositories

charent/ChatLM-mini-Chinese

How to train a 0.2B Chinese chatbot in your spare room

It exposes every step of building a 0.2B-parameter Chinese chatbot—from raw text cleaning to DPO—so you can train one on a GPU with just 4GB of VRAM.

★1.7k stars Python Language Models ML Frameworks

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

ChatLM-mini-Chinese is a 0.2B-parameter Chinese dialogue model built on a slimmed-down T5 backbone: ten encoder and decoder layers instead of T5-base’s twelve, plus a 29,298-token vocabulary weighted toward Chinese. The repository treats the model as a side effect of its real product, which is a complete, reproducible training pipeline. It ships the data sources, cleaning scripts, tokenizer training logic, text-to-text pre-training code, SFT trainer, and DPO preference optimization—enough to go from raw crawls to a chatbot that loads in 512MB of VRAM.

The interesting bit

The author built a custom trainer that streams gigabyte-scale datasets with buffer shuffling and mini-hash deduplication, deliberately avoiding memory and disk caches so the whole pre-training circus fits on a machine with 16GB of RAM and a 4GB GPU. There is also a working downstream example that fine-tunes the model for triplet information extraction without turning it into a single-task amnesiac, which is a neat trick for something this small.

Key highlights

Full pipeline transparency: data provenance (webtext2019zh, BELLE, Zhihu-KOL, etc.), cleaning, tokenizer training, pre-training, SFT, and DPO are all included and documented.
Consumer-grade feasibility: pre-trains at batch_size=1 on a 4GB GPU; inference in float16 needs only 512MB of VRAM.
Resilient training: custom trainer supports single-machine multi-GPU setups, arbitrary breakpoint resumption, and dynamic per-batch maximum length to squeeze into limited memory.
Preference optimization: supports both full-model DPO and LoRA-based DPO, with adapter merging back into the base model.
Downstream versatility: includes a triplet information extraction fine-tuning example that preserves the base dialogue capability.

Caveats

The README candidly warns that 0.2B parameters and a 9.3-million-sample corpus are insufficient for broad coverage, so the model can wander off-topic or ramble.
Streaming chat is hard-coded to greedy search; if you want beam sampling, you must disable the streamer.
The tokenizer and data pipeline are narrowly tuned for Chinese text with minimal English support, and the author explicitly dropped datasets containing complex tables or translation tasks.

Verdict

This is a teaching specimen for developers who want to see how a language model is actually born, not just how to prompt one. If you need a reliable, general-purpose Chinese conversationalist for production, look elsewhere—the author admits this is homework, not a finished oracle.

Frequently asked

What is charent/ChatLM-mini-Chinese?: It exposes every step of building a 0.2B-parameter Chinese chatbot—from raw text cleaning to DPO—so you can train one on a GPU with just 4GB of VRAM.
Is ChatLM-mini-Chinese open source?: Yes — charent/ChatLM-mini-Chinese is open source, released under the Apache-2.0 license.
What language is ChatLM-mini-Chinese written in?: charent/ChatLM-mini-Chinese is primarily written in Python.
How popular is ChatLM-mini-Chinese?: charent/ChatLM-mini-Chinese has 1.7k stars on GitHub.
Where can I find ChatLM-mini-Chinese?: charent/ChatLM-mini-Chinese is on GitHub at https://github.com/charent/ChatLM-mini-Chinese.