brightmart/nlp_chinese_corpus
A collection of large-scale Chinese text corpora for NLP model training and benchmarking.
★9.9k stars Data Tooling

Velocity · 7d
+3.7
★ / day
Trend
→steady
star history
This repository hosts millions of Chinese text datasets covering Wikipedia entries, news articles, Q&A pairs, community discussions, and translation pairs. The corpora are designed to support pre-training of Chinese language models and development of NLP systems. It includes pre-trained model references (ALBERT_Chinese) and is connected to the CLUE benchmark for Chinese language understanding evaluation.