← all repositories
piskvorky/gensim-data

A CDN for NLP datasets that won't vanish mid-experiment

Gensim-data pins popular corpora and embeddings to immutable GitHub releases so your word2vec tutorial still works in 2027.

1.1k stars Python Data ToolingLanguage Models
gensim-data
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

This is the storage backend for Gensim’s downloader API. It hosts pretrained models (word2vec, GloVe, fastText, ConceptNet) and text corpora (Wikipedia dumps, 20-newsgroups, patent text) as immutable GitHub release attachments. You don’t clone this repo; you call gensim.downloader.api.load("glove-twitter-25") and the data lands in ~/gensim-data.

The interesting bit

The README’s stated enemy is “research datasets regularly disappear, change over time, become obsolete.” Their fix is bureaucratic: each dataset version gets its own permanent release with license, provenance, and a usage example. It’s less a technical breakthrough than a promise that your text8 corpus won’t 404 because someone’s grad-school hosting expired.

Key highlights

  • 8 datasets including full 2017 Wikipedia (6.2 GB), USPTO patents (2.9 GB), and a fake-news corpus with refreshingly honest metadata
  • Pretrained embeddings from multiple families: GloVe Twitter variants (25d–200d), fastText wiki-news, ConceptNet NumberBatch
  • CLI and Python API: python -m gensim.downloader --info or api.load() with optional return_path=True for raw file access
  • Each release is frozen; new versions get new release tags rather than overwriting
  • Explicit license warnings per dataset (several marked “not found” or “probably” — the honesty is noted)

Caveats

  • Several datasets lack clear license attribution (20-newsgroups, patent-2017, text8 all “not found”)
  • Most embeddings are circa 2016–2017; no LLM-era models here
  • The 2017 Wikipedia dump is aging; no evidence of scheduled refresh cycles in the README

Verdict

Use this if you’re teaching, reproducing older papers, or need battle-tested word vectors without hunting down dead links. Skip it if you need modern contextual embeddings or legally pristine data for commercial products.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.