Is gensim-data open source?

Yes — piskvorky/gensim-data is open source, released under the LGPL-2.1 license.

What language is gensim-data written in?

piskvorky/gensim-data is primarily written in Python.

How popular is gensim-data?

piskvorky/gensim-data has 1.1k stars on GitHub.

Where can I find gensim-data?

piskvorky/gensim-data is on GitHub at https://github.com/piskvorky/gensim-data.

← all repositories

piskvorky/gensim-data

A CDN for NLP datasets that won't vanish mid-experiment

Gensim-data pins popular corpora and embeddings to immutable GitHub releases so your word2vec tutorial still works in 2027.

★1.1k stars Python Data Tooling Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This is the storage backend for Gensim’s downloader API. It hosts pretrained models (word2vec, GloVe, fastText, ConceptNet) and text corpora (Wikipedia dumps, 20-newsgroups, patent text) as immutable GitHub release attachments. You don’t clone this repo; you call gensim.downloader.api.load("glove-twitter-25") and the data lands in ~/gensim-data.

The interesting bit

The README’s stated enemy is “research datasets regularly disappear, change over time, become obsolete.” Their fix is bureaucratic: each dataset version gets its own permanent release with license, provenance, and a usage example. It’s less a technical breakthrough than a promise that your text8 corpus won’t 404 because someone’s grad-school hosting expired.

Key highlights

8 datasets including full 2017 Wikipedia (6.2 GB), USPTO patents (2.9 GB), and a fake-news corpus with refreshingly honest metadata
Pretrained embeddings from multiple families: GloVe Twitter variants (25d–200d), fastText wiki-news, ConceptNet NumberBatch
CLI and Python API: python -m gensim.downloader --info or api.load() with optional return_path=True for raw file access
Each release is frozen; new versions get new release tags rather than overwriting
Explicit license warnings per dataset (several marked “not found” or “probably” — the honesty is noted)

Caveats

Several datasets lack clear license attribution (20-newsgroups, patent-2017, text8 all “not found”)
Most embeddings are circa 2016–2017; no LLM-era models here
The 2017 Wikipedia dump is aging; no evidence of scheduled refresh cycles in the README

Verdict

Use this if you’re teaching, reproducing older papers, or need battle-tested word vectors without hunting down dead links. Skip it if you need modern contextual embeddings or legally pristine data for commercial products.

Frequently asked

What is piskvorky/gensim-data?: Gensim-data pins popular corpora and embeddings to immutable GitHub releases so your word2vec tutorial still works in 2027.
Is gensim-data open source?: Yes — piskvorky/gensim-data is open source, released under the LGPL-2.1 license.
What language is gensim-data written in?: piskvorky/gensim-data is primarily written in Python.
How popular is gensim-data?: piskvorky/gensim-data has 1.1k stars on GitHub.
Where can I find gensim-data?: piskvorky/gensim-data is on GitHub at https://github.com/piskvorky/gensim-data.