← all repositories
jiaeyan/Jiayan

NLP for the Analects: a toolkit that knows 之乎者也

Modern Chinese NLP tools butcher classical texts; Jiayan was built to parse them properly.

673 stars Python ML FrameworksData Tooling
Jiayan
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

Jiayan is a Python toolkit for processing Classical Chinese (文言文). It handles lexicon construction, tokenization, POS tagging, sentence segmentation, and automatic punctuation — all trained or tuned for ancient rather than modern Chinese. You install it via pip, download pre-trained models from Baidu Netdisk, and feed it strings of unbroken classical text.

The interesting bit

The core trick is unsupervised lexicon building: it mines candidate words from raw text using a double trie, pointwise mutual information, and left/right entropy — no annotated corpus required. Tokenization then runs on an N-gram HMM or a word-DAG with dynamic programming, and sentence segmentation plus punctuation are stacked CRF layers. The README includes a satisfying side-by-side where LTP and HanLP mangle a Zhuangzi passage while Jiayan keeps 内圣外王 intact.

Key highlights

  • Unsupervised lexicon construction from raw classical text (PMI + entropy on double tries)
  • Two tokenization modes: character-level HMM with KenLM, or dictionary-based max-probability path
  • Stacked CRFs for sentence segmentation and punctuation restoration
  • POS tagging with a dedicated tag set for classical grammar
  • Pre-trained models and a Zhuangzi sample file distributed via Baidu Netdisk (extract code: p0sc)

Caveats

  • Simplified Chinese only;繁体 input must be converted with OpenCC first and converted back after
  • Models live on Baidu Netdisk, which is a friction point for non-China users
  • Classical-to-modern translation is listed as “in development” with no timeline

Verdict

Digital humanists, philology grad students, and anyone building classical Chinese search or analysis pipelines should grab this. If your text is modern Chinese, or you need a polished SaaS API, this is the wrong tool.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.