Is MNBVC open source?

Yes — esbatmop/MNBVC is open source, released under the MIT license.

How popular is MNBVC?

esbatmop/MNBVC has 4.2k stars on GitHub.

Where can I find MNBVC?

esbatmop/MNBVC is on GitHub at https://github.com/esbatmop/MNBVC.

esbatmop/MNBVC

The 253TB Chinese corpus that begs you to ignore it

A veteran Chinese internet forum is scraping the entire Chinese web into a 253-terabyte text archive to rival ChatGPT’s training data, from ancient poetry to court rulings, while actively begging journalists to leave them alone.

★4.2k stars Data Tooling Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

MNBVC is a sprawling Chinese text corpus maintained by the MOP Liwu Community, a long-running Chinese internet forum. It has stockpiled roughly 60 TB of raw text scraped from across the Chinese web—news, novels, ancient poetry, chat logs, court judgments, and even Martian text—and aims to hit 253 TB. The repository serves as a coordination hub linking dozens of satellite projects that handle encoding detection, deduplication, PDF parsing, and source-specific cleaning.

The interesting bit

The project’s defining quirk is its aggressive modesty. The maintainers explicitly beg the media not to report on them, fearing hype will kill the project. To minimize copyright exposure, they deliberately withhold indexes and classifications, distributing encrypted compressed packages that contain only rough HTML-to-text conversions and a screenshot of the source webpage.

Key highlights

60,732 GB collected so far toward a 253 TB goal, roughly 24% complete
Covers mainstream and fringe Chinese internet culture, from Foreign Ministry press transcripts and exam questions to script-murder game scripts
Distributed via P2P (VerySync) and Baidu Netdisk; compressed packages are encrypted and each folder includes a PNG screenshot of the data source
Ecosystem of satellite repos handles encoding detection, deduplication, GitHub and Bitbucket code crawling, Arxiv parsing, and format normalization
Data is only coarsely processed and desensitized by stripping number strings of eight digits or more

Caveats

No index or classification is provided, making targeted retrieval essentially impossible by design
The team explicitly states they cannot audit copyright status of any source material
Storage requirements are massive: the first P2P partition alone requires more than 10 TB of disk space

Verdict

Useful if you are training Chinese foundation models and can handle raw, uncurated data with unclear provenance. Avoid if you need a rights-cleared, neatly labeled, or easily searchable dataset.

Frequently asked

What is esbatmop/MNBVC?: A veteran Chinese internet forum is scraping the entire Chinese web into a 253-terabyte text archive to rival ChatGPT’s training data, from ancient poetry to court rulings, while actively begging journalists to leave them alone.
Is MNBVC open source?: Yes — esbatmop/MNBVC is open source, released under the MIT license.
How popular is MNBVC?: esbatmop/MNBVC has 4.2k stars on GitHub.
Where can I find MNBVC?: esbatmop/MNBVC is on GitHub at https://github.com/esbatmop/MNBVC.