← all repositories

brightmart/nlp_chinese_corpus

A collection of large-scale Chinese text corpora for NLP model training and benchmarking.

9.9k stars Data Tooling
nlp_chinese_corpus
Velocity · 7d
+3.7
★ / day
Trend
steady
star history

This repository hosts millions of Chinese text datasets covering Wikipedia entries, news articles, Q&A pairs, community discussions, and translation pairs. The corpora are designed to support pre-training of Chinese language models and development of NLP systems. It includes pre-trained model references (ALBERT_Chinese) and is connected to the CLUE benchmark for Chinese language understanding evaluation.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.