← all repositories

esbatmop/MNBVC

A 60+ TB Chinese text corpus dataset for training language models, aiming to reach 253 TB of diverse Chinese internet content.

MNBVC
Velocity · 7d
+3.4
★ / day
Trend
steady
star history

MNBVC is a large-scale Chinese text corpus collection sourced from the internet, encompassing news, essays, novels, books, magazines, papers, scripts, posts, wiki, ancient poetry, lyrics, product descriptions, jokes, chat records, and other forms of plain-text Chinese data. The dataset currently contains over 60 TB of data and is structured for ML training purposes, with files formatted as JSON, JSONL, and Parquet. Data is deduplicated and basic preprocessing (HTML/XML to TXT conversion, CSV/TSV to JSON conversion) has been applied.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.