← all repositories
maciejkula/glove-python

A toy GloVe implementation that knows it's buggy

A self-aware Cython port of Stanford's word embedding algorithm, complete with OpenMP headaches and paragraph vectors.

1.3k stars Python Language Models
glove-python
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

Turns a text corpus into dense word vectors by factorizing the log of a word co-occurrence matrix. The API is sklearn-esque: build a Corpus, train a Glove model, then query most_similar('physics') to watch it return ‘biology’ and ‘chemistry’. There’s also a transform_paragraph method for rudimentary paragraph vectors weighted roughly like tf-idf.

The interesting bit

The author labels it a “toy” implementation and cheerfully admits it “most likely, it contains a tremendous amount of bugs.” That honesty is refreshing in a space crowded with overclaimed embedding libraries. The Cython + asynchronous SGD core is real enough to train on Wikipedia dumps and produce sensible analogies, but the framing keeps expectations honest.

Key highlights

  • Two-stage pipeline: co-occurrence matrix construction, then matrix factorization via SGD
  • Cython implementation with OpenMP parallelism for training speed
  • Rudimentary paragraph vectors via transform_paragraph
  • Ships with example.py and a make all-wiki target for quick Wikipedia experiments
  • Published to PyPI as glove_python

Caveats

  • macOS compilation is broken under Clang; needs Homebrew/Anaconda gcc and Python
  • The author explicitly warns of “a tremendous amount of bugs”
  • Paragraph vectors are described as “rudimentary”

Verdict

Worth a look if you want to understand GloVe’s mechanics without the Stanford C boilerplate, or need a hackable embedding baseline. Skip it for production — use gensim’s word2vec or a mature GloVe binding instead.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.