A Chinese tokenizer that asks for your email first
Tsinghua's venerable NLP lab offers a competent Chinese segmenter with a bureaucratic model-download step.

What it does
THULAC is a Chinese word segmentation and part-of-speech tagging toolkit from Tsinghua’s NLP lab. You feed it UTF-8 text; it returns words with tags like n (noun) or np (person name). It runs as a Python module, command-line tool, or via C++ shared-object fast paths.
The interesting bit
The project ships with a “simple” model and a joint segmentation-POS model trained on People’s Daily, but keeps its best-performing multi-corpus “Model_3” behind an application form. The README’s benchmark tables show it trading blows with jieba and ICTCLAS on standard test sets—slower than jieba on raw throughput, generally more accurate on precision.
Key highlights
- CTB5 F1 scores: 97.3% segmentation, 92.9% POS tagging (per README claims)
- Segmentation-only speed: 1.3 MB/s; joint tagging: ~300 KB/s
- Supports user dictionaries, traditional-to-simplified conversion, and a filter for “meaningless” words
fast_cutinterface available via separate C++.socompilation- Python 2.x and 3.x compatible
Caveats
- Only UTF-8 is supported; other encodings are “coming soon” since at least 2016
- Full model download requires submitting personal info to a Tsinghua website
- Last meaningful update appears to be 2017; the fast
.sorepo may be stale
Verdict
Worth a look if you need a research-credentialed Chinese segmenter with decent accuracy and don’t mind the model-acquisition paperwork. Skip if you want zero-config installation or actively maintained dependencies.