Is THULAC-Python open source?

Yes — thunlp/THULAC-Python is open source, released under the MIT license.

What language is THULAC-Python written in?

thunlp/THULAC-Python is primarily written in Python.

How popular is THULAC-Python?

thunlp/THULAC-Python has 2.1k stars on GitHub.

Where can I find THULAC-Python?

thunlp/THULAC-Python is on GitHub at https://github.com/thunlp/THULAC-Python.

← all repositories

thunlp/THULAC-Python

A Chinese tokenizer that asks for your email first

Tsinghua's venerable NLP lab offers a competent Chinese segmenter with a bureaucratic model-download step.

★2.1k stars Python Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

THULAC is a Chinese word segmentation and part-of-speech tagging toolkit from Tsinghua’s NLP lab. You feed it UTF-8 text; it returns words with tags like n (noun) or np (person name). It runs as a Python module, command-line tool, or via C++ shared-object fast paths.

The interesting bit

The project ships with a “simple” model and a joint segmentation-POS model trained on People’s Daily, but keeps its best-performing multi-corpus “Model_3” behind an application form. The README’s benchmark tables show it trading blows with jieba and ICTCLAS on standard test sets—slower than jieba on raw throughput, generally more accurate on precision.

Key highlights

CTB5 F1 scores: 97.3% segmentation, 92.9% POS tagging (per README claims)
Segmentation-only speed: 1.3 MB/s; joint tagging: ~300 KB/s
Supports user dictionaries, traditional-to-simplified conversion, and a filter for “meaningless” words
fast_cut interface available via separate C++ .so compilation
Python 2.x and 3.x compatible

Caveats

Only UTF-8 is supported; other encodings are “coming soon” since at least 2016
Full model download requires submitting personal info to a Tsinghua website
Last meaningful update appears to be 2017; the fast .so repo may be stale

Verdict

Worth a look if you need a research-credentialed Chinese segmenter with decent accuracy and don’t mind the model-acquisition paperwork. Skip if you want zero-config installation or actively maintained dependencies.

Frequently asked

What is thunlp/THULAC-Python?: Tsinghua's venerable NLP lab offers a competent Chinese segmenter with a bureaucratic model-download step.
Is THULAC-Python open source?: Yes — thunlp/THULAC-Python is open source, released under the MIT license.
What language is THULAC-Python written in?: thunlp/THULAC-Python is primarily written in Python.
How popular is THULAC-Python?: thunlp/THULAC-Python has 2.1k stars on GitHub.
Where can I find THULAC-Python?: thunlp/THULAC-Python is on GitHub at https://github.com/thunlp/THULAC-Python.