3GB of WeChat articles, stripped bare for NLP researchers
A scraped, cleaned corpus of WeChat public account articles in JSON format, released for research use.

What it does
This repository distributes roughly 3GB of text scraped from WeChat public accounts (微信公众号). The HTML has been stripped out, leaving one JSON object per line with fields for account name, ID, article title, and body text. Data arrives in password-free split zip archives; a preview.json gives you the shape before downloading the full set.
The interesting bit
The cleaning is the product. Raw WeChat HTML is noisy—ads, formatting, scripts—so having pre-extracted plain text saves researchers a tedious preprocessing step. The maintainer also commits to periodic updates, which matters for a platform where content churns fast.
Key highlights
- ~3GB of Chinese text, one article per line in JSONL-ish format
- Fields:
name,account,title,content - Split zip archives, no password hassle
preview.jsonfor quick inspection- Explicitly research-use only (no commercial license stated)
- Maintainer open to requests via GitHub Issues
Caveats
- No documentation on scraping methodology, date ranges, or account selection criteria—so representativeness is unclear
- “定期更新” is promised but no schedule or changelog is visible
- No license file; the “research only” request is a wish, not a legal framework
Verdict
Grab it if you need bulk Chinese text for NLP experiments and can live with opaque provenance. Skip if your project requires strict reproducibility, legal clarity, or up-to-the-month freshness.