A Chinese lexicon that knows C++ from 冬虫夏草
Tsinghua's open word list gives domain-specific Chinese tokens with document-frequency data, no training required.

What it does THUOCL is a collection of curated Chinese word lists across eleven domains—IT, finance, law, medicine, food, cars, animals, places, historical figures, poems, and idioms. Each entry comes with a DF (Document Frequency) value drawn from real corpora like Sina News, CSDN blogs, or Sogou data. The format is dead simple: one word, one tab, one number.
The interesting bit The value is in the curation, not the algorithm. These are human-filtered domain terms—“C++编程” and “强连通缩点” in IT, “冬虫夏草” in medicine—ready to drop into a segmenter as a custom dictionary. The project explicitly pairs with THULAC, Tsinghua’s segmentation toolkit, for domain-tuned Chinese NLP.
Key highlights
- 11 category-specific word lists, from 1,752 entries (cars) to 44,805 (places)
- Each entry includes DF score for frequency-aware filtering
- Sourced from social tags, search hot words, and IME dictionaries, then manually vetted across multiple rounds
- Free for both research and commercial use
- Last substantive updates in 2016–2017
Caveats
- Most lists haven’t been updated since 2016–2017; the “open update” promise appears stalled
- No code or API—just raw
.txtfiles, so you’re doing the integration work - Scale varies wildly by domain (cars is tiny, places is huge)
Verdict Grab this if you’re building Chinese NLP in a specialized domain and need a vetted starter lexicon. Skip it if you want a living, actively maintained resource or anything beyond flat text files.