← all repositories
thunlp/THUOCL

A Chinese lexicon that knows C++ from 冬虫夏草

Tsinghua's open word list gives domain-specific Chinese tokens with document-frequency data, no training required.

1.1k stars Data Tooling
THUOCL
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does THUOCL is a collection of curated Chinese word lists across eleven domains—IT, finance, law, medicine, food, cars, animals, places, historical figures, poems, and idioms. Each entry comes with a DF (Document Frequency) value drawn from real corpora like Sina News, CSDN blogs, or Sogou data. The format is dead simple: one word, one tab, one number.

The interesting bit The value is in the curation, not the algorithm. These are human-filtered domain terms—“C++编程” and “强连通缩点” in IT, “冬虫夏草” in medicine—ready to drop into a segmenter as a custom dictionary. The project explicitly pairs with THULAC, Tsinghua’s segmentation toolkit, for domain-tuned Chinese NLP.

Key highlights

  • 11 category-specific word lists, from 1,752 entries (cars) to 44,805 (places)
  • Each entry includes DF score for frequency-aware filtering
  • Sourced from social tags, search hot words, and IME dictionaries, then manually vetted across multiple rounds
  • Free for both research and commercial use
  • Last substantive updates in 2016–2017

Caveats

  • Most lists haven’t been updated since 2016–2017; the “open update” promise appears stalled
  • No code or API—just raw .txt files, so you’re doing the integration work
  • Scale varies wildly by domain (cars is tiny, places is huge)

Verdict Grab this if you’re building Chinese NLP in a specialized domain and need a vetted starter lexicon. Skip it if you want a living, actively maintained resource or anything beyond flat text files.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.