Is THUOCL open source?

Yes — thunlp/THUOCL is open source, released under the MIT license.

How popular is THUOCL?

thunlp/THUOCL has 1.1k stars on GitHub.

Where can I find THUOCL?

thunlp/THUOCL is on GitHub at https://github.com/thunlp/THUOCL.

← all repositories

thunlp/THUOCL

A Chinese lexicon that knows C++ from 冬虫夏草

Tsinghua's open word list gives domain-specific Chinese tokens with document-frequency data, no training required.

★1.1k stars Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does THUOCL is a collection of curated Chinese word lists across eleven domains—IT, finance, law, medicine, food, cars, animals, places, historical figures, poems, and idioms. Each entry comes with a DF (Document Frequency) value drawn from real corpora like Sina News, CSDN blogs, or Sogou data. The format is dead simple: one word, one tab, one number.

The interesting bit The value is in the curation, not the algorithm. These are human-filtered domain terms—“C++编程” and “强连通缩点” in IT, “冬虫夏草” in medicine—ready to drop into a segmenter as a custom dictionary. The project explicitly pairs with THULAC, Tsinghua’s segmentation toolkit, for domain-tuned Chinese NLP.

Key highlights

11 category-specific word lists, from 1,752 entries (cars) to 44,805 (places)
Each entry includes DF score for frequency-aware filtering
Sourced from social tags, search hot words, and IME dictionaries, then manually vetted across multiple rounds
Free for both research and commercial use
Last substantive updates in 2016–2017

Caveats

Most lists haven’t been updated since 2016–2017; the “open update” promise appears stalled
No code or API—just raw .txt files, so you’re doing the integration work
Scale varies wildly by domain (cars is tiny, places is huge)

Verdict Grab this if you’re building Chinese NLP in a specialized domain and need a vetted starter lexicon. Skip it if you want a living, actively maintained resource or anything beyond flat text files.

Frequently asked

What is thunlp/THUOCL?: Tsinghua's open word list gives domain-specific Chinese tokens with document-frequency data, no training required.
Is THUOCL open source?: Yes — thunlp/THUOCL is open source, released under the MIT license.
How popular is THUOCL?: thunlp/THUOCL has 1.1k stars on GitHub.
Where can I find THUOCL?: thunlp/THUOCL is on GitHub at https://github.com/thunlp/THUOCL.