← all repositories
nonamestreet/weixin_public_corpus

3GB of WeChat articles, stripped bare for NLP researchers

A scraped, cleaned corpus of WeChat public account articles in JSON format, released for research use.

594 stars Data Tooling
weixin_public_corpus
Velocity · 7d
+0.2
★ / day
Trend
steady
star history

What it does

This repository distributes roughly 3GB of text scraped from WeChat public accounts (微信公众号). The HTML has been stripped out, leaving one JSON object per line with fields for account name, ID, article title, and body text. Data arrives in password-free split zip archives; a preview.json gives you the shape before downloading the full set.

The interesting bit

The cleaning is the product. Raw WeChat HTML is noisy—ads, formatting, scripts—so having pre-extracted plain text saves researchers a tedious preprocessing step. The maintainer also commits to periodic updates, which matters for a platform where content churns fast.

Key highlights

  • ~3GB of Chinese text, one article per line in JSONL-ish format
  • Fields: name, account, title, content
  • Split zip archives, no password hassle
  • preview.json for quick inspection
  • Explicitly research-use only (no commercial license stated)
  • Maintainer open to requests via GitHub Issues

Caveats

  • No documentation on scraping methodology, date ranges, or account selection criteria—so representativeness is unclear
  • “定期更新” is promised but no schedule or changelog is visible
  • No license file; the “research only” request is a wish, not a legal framework

Verdict

Grab it if you need bulk Chinese text for NLP experiments and can live with opaque provenance. Skip if your project requires strict reproducibility, legal clarity, or up-to-the-month freshness.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.