← all repositories
thunlp/THULAC-Python

A Chinese tokenizer that asks for your email first

Tsinghua's venerable NLP lab offers a competent Chinese segmenter with a bureaucratic model-download step.

2.1k stars Python Data Tooling
THULAC-Python
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

THULAC is a Chinese word segmentation and part-of-speech tagging toolkit from Tsinghua’s NLP lab. You feed it UTF-8 text; it returns words with tags like n (noun) or np (person name). It runs as a Python module, command-line tool, or via C++ shared-object fast paths.

The interesting bit

The project ships with a “simple” model and a joint segmentation-POS model trained on People’s Daily, but keeps its best-performing multi-corpus “Model_3” behind an application form. The README’s benchmark tables show it trading blows with jieba and ICTCLAS on standard test sets—slower than jieba on raw throughput, generally more accurate on precision.

Key highlights

  • CTB5 F1 scores: 97.3% segmentation, 92.9% POS tagging (per README claims)
  • Segmentation-only speed: 1.3 MB/s; joint tagging: ~300 KB/s
  • Supports user dictionaries, traditional-to-simplified conversion, and a filter for “meaningless” words
  • fast_cut interface available via separate C++ .so compilation
  • Python 2.x and 3.x compatible

Caveats

  • Only UTF-8 is supported; other encodings are “coming soon” since at least 2016
  • Full model download requires submitting personal info to a Tsinghua website
  • Last meaningful update appears to be 2017; the fast .so repo may be stale

Verdict

Worth a look if you need a research-credentialed Chinese segmenter with decent accuracy and don’t mind the model-acquisition paperwork. Skip if you want zero-config installation or actively maintained dependencies.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.