← all repositories
ckiplab/ckiptagger

A neural Chinese NLP pipeline that won't mangle your text

CKIP's tagger segments, tags, and recognizes entities in Chinese while preserving every character you feed it.

1.7k stars Python ML FrameworksOther AI
ckiptagger
Velocity · 7d
+0.7
★ / day
Trend
steady
star history

What it does CkipTagger runs word segmentation, part-of-speech tagging, and named entity recognition on Chinese text. It’s a Python library from Taiwan’s Academia Sinica that you install via pip, download 2GB of model files, and invoke as WS, POS, and NER objects. The pipeline handles Traditional Chinese and claims to support indefinitely long sentences without auto-deleting or changing characters.

The interesting bit The project is explicitly conservative: it promises not to “auto delete/change/add characters,” which is a subtler brag than it sounds. Chinese segmentation tools often normalize or strip punctuation silently; this one treats your input as immutable. It also lets you nudge the segmenter with weighted word lists — a “recommend” dictionary and a stricter “coerce” dictionary — rather than forcing you to accept its neural network’s best guess.

Key highlights

  • F1 of 97.33% on ASBC 4.0 word segmentation, beating the classic CKIPWS (95.91%) and Jieba-zh_TW (89.80%)
  • POS accuracy of 94.59% on the same corpus
  • GPU support via TensorFlow/CUDA, CPU fallback works out of the box
  • User-defined dictionaries with per-word weights for segmentation hints
  • Published model architecture: BiLSTM with attention, from an AAAI 2020 paper

Caveats

  • Requires ~2GB model download from Google Drive or an IIS mirror; no clear versioning of model files
  • Backend is tf-keras on TensorFlow, so you’re inheriting that dependency stack
  • GPL-3.0 license, which may complicate commercial use if you distribute derivatives

Verdict Worth a look if you need accurate Traditional Chinese NLP and care about text fidelity — the no-mutation guarantee matters for downstream tasks. Skip if you wanted something lightweight or permissively licensed; this is a research-grade tool with research-grade baggage.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.