A toy GloVe implementation that knows it's buggy
A self-aware Cython port of Stanford's word embedding algorithm, complete with OpenMP headaches and paragraph vectors.

What it does
Turns a text corpus into dense word vectors by factorizing the log of a word co-occurrence matrix. The API is sklearn-esque: build a Corpus, train a Glove model, then query most_similar('physics') to watch it return ‘biology’ and ‘chemistry’. There’s also a transform_paragraph method for rudimentary paragraph vectors weighted roughly like tf-idf.
The interesting bit
The author labels it a “toy” implementation and cheerfully admits it “most likely, it contains a tremendous amount of bugs.” That honesty is refreshing in a space crowded with overclaimed embedding libraries. The Cython + asynchronous SGD core is real enough to train on Wikipedia dumps and produce sensible analogies, but the framing keeps expectations honest.
Key highlights
- Two-stage pipeline: co-occurrence matrix construction, then matrix factorization via SGD
- Cython implementation with OpenMP parallelism for training speed
- Rudimentary paragraph vectors via
transform_paragraph - Ships with
example.pyand amake all-wikitarget for quick Wikipedia experiments - Published to PyPI as
glove_python
Caveats
- macOS compilation is broken under Clang; needs Homebrew/Anaconda gcc and Python
- The author explicitly warns of “a tremendous amount of bugs”
- Paragraph vectors are described as “rudimentary”
Verdict
Worth a look if you want to understand GloVe’s mechanics without the Stanford C boilerplate, or need a hackable embedding baseline. Skip it for production — use gensim’s word2vec or a mature GloVe binding instead.