Is weixin_public_corpus open source?

Yes — nonamestreet/weixin_public_corpus is an open-source project tracked on heatdrop.

How popular is weixin_public_corpus?

nonamestreet/weixin_public_corpus has 593 stars on GitHub.

Where can I find weixin_public_corpus?

nonamestreet/weixin_public_corpus is on GitHub at https://github.com/nonamestreet/weixin_public_corpus.

← all repositories

nonamestreet/weixin_public_corpus

3GB of WeChat articles, stripped bare for NLP researchers

A scraped, cleaned corpus of WeChat public account articles in JSON format, released for research use.

★593 stars Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This repository distributes roughly 3GB of text scraped from WeChat public accounts (微信公众号). The HTML has been stripped out, leaving one JSON object per line with fields for account name, ID, article title, and body text. Data arrives in password-free split zip archives; a preview.json gives you the shape before downloading the full set.

The interesting bit

The cleaning is the product. Raw WeChat HTML is noisy—ads, formatting, scripts—so having pre-extracted plain text saves researchers a tedious preprocessing step. The maintainer also commits to periodic updates, which matters for a platform where content churns fast.

Key highlights

~3GB of Chinese text, one article per line in JSONL-ish format
Fields: name, account, title, content
Split zip archives, no password hassle
preview.json for quick inspection
Explicitly research-use only (no commercial license stated)
Maintainer open to requests via GitHub Issues

Caveats

No documentation on scraping methodology, date ranges, or account selection criteria—so representativeness is unclear
“定期更新” is promised but no schedule or changelog is visible
No license file; the “research only” request is a wish, not a legal framework

Verdict

Grab it if you need bulk Chinese text for NLP experiments and can live with opaque provenance. Skip if your project requires strict reproducibility, legal clarity, or up-to-the-month freshness.

Frequently asked

What is nonamestreet/weixin_public_corpus?: A scraped, cleaned corpus of WeChat public account articles in JSON format, released for research use.
Is weixin_public_corpus open source?: Yes — nonamestreet/weixin_public_corpus is an open-source project tracked on heatdrop.
How popular is weixin_public_corpus?: nonamestreet/weixin_public_corpus has 593 stars on GitHub.
Where can I find weixin_public_corpus?: nonamestreet/weixin_public_corpus is on GitHub at https://github.com/nonamestreet/weixin_public_corpus.