A neural Chinese NLP pipeline that won't mangle your text
CKIP's tagger segments, tags, and recognizes entities in Chinese while preserving every character you feed it.

What it does
CkipTagger runs word segmentation, part-of-speech tagging, and named entity recognition on Chinese text. It’s a Python library from Taiwan’s Academia Sinica that you install via pip, download 2GB of model files, and invoke as WS, POS, and NER objects. The pipeline handles Traditional Chinese and claims to support indefinitely long sentences without auto-deleting or changing characters.
The interesting bit The project is explicitly conservative: it promises not to “auto delete/change/add characters,” which is a subtler brag than it sounds. Chinese segmentation tools often normalize or strip punctuation silently; this one treats your input as immutable. It also lets you nudge the segmenter with weighted word lists — a “recommend” dictionary and a stricter “coerce” dictionary — rather than forcing you to accept its neural network’s best guess.
Key highlights
- F1 of 97.33% on ASBC 4.0 word segmentation, beating the classic CKIPWS (95.91%) and Jieba-zh_TW (89.80%)
- POS accuracy of 94.59% on the same corpus
- GPU support via TensorFlow/CUDA, CPU fallback works out of the box
- User-defined dictionaries with per-word weights for segmentation hints
- Published model architecture: BiLSTM with attention, from an AAAI 2020 paper
Caveats
- Requires ~2GB model download from Google Drive or an IIS mirror; no clear versioning of model files
- Backend is tf-keras on TensorFlow, so you’re inheriting that dependency stack
- GPL-3.0 license, which may complicate commercial use if you distribute derivatives
Verdict Worth a look if you need accurate Traditional Chinese NLP and care about text fidelity — the no-mutation guarantee matters for downstream tasks. Skip if you wanted something lightweight or permissively licensed; this is a research-grade tool with research-grade baggage.