Is Chinese-Word-Vectors open source?

Yes — Embedding/Chinese-Word-Vectors is open source, released under the Apache-2.0 license.

What language is Chinese-Word-Vectors written in?

Embedding/Chinese-Word-Vectors is primarily written in Python.

How popular is Chinese-Word-Vectors?

Embedding/Chinese-Word-Vectors has 12.2k stars on GitHub.

Where can I find Chinese-Word-Vectors?

Embedding/Chinese-Word-Vectors is on GitHub at https://github.com/Embedding/Chinese-Word-Vectors.

← all repositories

Embedding/Chinese-Word-Vectors

A buffet of Chinese word embeddings, no training required

A research project that ships over 100 pre-trained Chinese word vectors across multiple corpora and linguistic representations, plus a benchmark to keep them honest.

★12.2k stars Python Language Models Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository distributes more than 100 pre-trained Chinese word vectors generated from corpora as varied as Wikipedia, Weibo, the People’s Daily, and the Complete Library in Four Sections. The release covers two mathematical representations—dense 300-dimensional SGNS embeddings and sparse PPMI vectors—each trained with several context configurations (word-only, plus n-gram, plus character, and combinations). Think of it as a systematic tasting menu: pick the vector set that matches your domain instead of training your own.

The interesting bit

The authors back up the downloads with CA8, a Chinese analogical reasoning dataset, and an evaluation toolkit so you can actually measure how well the embeddings behave. It is unusual to see this level of methodological rigor in a bulk release—every set shares the same hyperparameters, so comparisons are closer to apples-to-apples than usual.

Key highlights

Dense SGNS (word2vec-style) and sparse PPMI vectors, each crossed with word, character, and n-gram context features.
Coverage spans modern sources (Baidu Encyclopedia, Zhihu QA, financial news, Sogou News) to classical literature.
Includes the CA8 benchmark and an evaluation toolkit for intrinsic quality testing.
Consistent training recipe across the board: 300 dimensions, window size 5, dynamic windows, sub-sampling at 1e-5.
A few sets are mirrored on Google Drive, though the majority of links point to Baidu Netdisk.

Caveats

The README is essentially a large download matrix; visible training code or detailed API documentation is absent from the text.
Most file hosting is on Baidu Netdisk, which can be slow or inaccessible outside mainland China.
It is unclear from the README whether the underlying corpora are version-locked or if embeddings will be retrained.

Verdict

Ideal if you need off-the-shelf Chinese embeddings with documented provenance and a built-in benchmark. Less useful if you want a training framework or an actively maintained Python package—this is a curated research release, not a library.

Frequently asked

What is Embedding/Chinese-Word-Vectors?: A research project that ships over 100 pre-trained Chinese word vectors across multiple corpora and linguistic representations, plus a benchmark to keep them honest.
Is Chinese-Word-Vectors open source?: Yes — Embedding/Chinese-Word-Vectors is open source, released under the Apache-2.0 license.
What language is Chinese-Word-Vectors written in?: Embedding/Chinese-Word-Vectors is primarily written in Python.
How popular is Chinese-Word-Vectors?: Embedding/Chinese-Word-Vectors has 12.2k stars on GitHub.
Where can I find Chinese-Word-Vectors?: Embedding/Chinese-Word-Vectors is on GitHub at https://github.com/Embedding/Chinese-Word-Vectors.