Is nlp_chinese_corpus open source?

Yes — brightmart/nlp_chinese_corpus is open source, released under the MIT license.

How popular is nlp_chinese_corpus?

brightmart/nlp_chinese_corpus has 9.9k stars on GitHub.

Where can I find nlp_chinese_corpus?

brightmart/nlp_chinese_corpus is on GitHub at https://github.com/brightmart/nlp_chinese_corpus.

brightmart/nlp_chinese_corpus

Five bulk bins of Chinese text for model training

It packages five million-scale Chinese datasets into downloadable JSON so researchers can stop hunting across Baidu and start training.

★9.9k stars Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository is essentially a curated data pantry, not a codebase. It hosts five large Chinese text collections—wiki2019zh, news2016zh, baike2018qa, webtext2019zh, and translation2019zh—totaling millions of records, each provided as compressed JSON with documented fields like title, text, category, and answer. The goal is to give practitioners a single place to grab bulk Chinese text for pre-training, word embeddings, or supervised tasks without writing their own crawlers.

The interesting bit

The project is a time capsule from the pre-ChatGPT era, explicitly born out of the author’s frustration that searching Baidu and GitHub yielded only small, stale, or hard-to-process Chinese corpora. It also bundles useful metadata—news keywords, Q&A categories, and translation pairs—turning a raw dump into something usable for title generation or category prediction with minimal extra work.

Key highlights

wiki2019zh: ~1.04 million structured Chinese Wikipedia entries with titles and clean paragraph breaks.
news2016zh: 2.5 million news articles from 63,000 media sources, including titles, keywords, descriptions, and timestamps.
baike2018qa: 1.5 million filtered Q&A pairs across 492 categories, split into training and validation sets.
webtext2019zh: 4.1 million community Q&A pairs positioned as high-quality fuel for very large language models.
translation2019zh: 5.2 million Chinese-English sentence pairs for bilingual tasks.

Caveats

The text is frozen circa 2016–2019; news articles span 2014–2016 and Wikipedia was last updated in February 2019, so the content is not contemporary.
Test sets for news2016zh and baike2018qa are mentioned but explicitly withheld from download.
This is a collection of external datasets with download links and schema docs; do not expect scripts, maintenance, or a living data pipeline.

Verdict

Grab this if you need large, pre-cleaned foundational Chinese text for training or benchmarking and do not want to build a crawler. Skip it if you need current events or an actively maintained stream—this is a 2019-era archive, not a live firehose.

Frequently asked

What is brightmart/nlp_chinese_corpus?: It packages five million-scale Chinese datasets into downloadable JSON so researchers can stop hunting across Baidu and start training.
Is nlp_chinese_corpus open source?: Yes — brightmart/nlp_chinese_corpus is open source, released under the MIT license.
How popular is nlp_chinese_corpus?: brightmart/nlp_chinese_corpus has 9.9k stars on GitHub.
Where can I find nlp_chinese_corpus?: brightmart/nlp_chinese_corpus is on GitHub at https://github.com/brightmart/nlp_chinese_corpus.