esbatmop/MNBVC
A 60+ TB Chinese text corpus dataset for training language models, aiming to reach 253 TB of diverse Chinese internet content.

MNBVC is a large-scale Chinese text corpus collection sourced from the internet, encompassing news, essays, novels, books, magazines, papers, scripts, posts, wiki, ancient poetry, lyrics, product descriptions, jokes, chat records, and other forms of plain-text Chinese data. The dataset currently contains over 60 TB of data and is structured for ML training purposes, with files formatted as JSON, JSONL, and Parquet. Data is deduplicated and basic preprocessing (HTML/XML to TXT conversion, CSV/TSV to JSON conversion) has been applied.