← all repositories
Hironsan/awesome-embedding-models

A reading list for people who miss word2vec

A curated index of embedding-model papers, tools, and pre-trained vectors from the era before LLMs ate everything.

1.8k stars Jupyter Notebook LearningLanguage Models
awesome-embedding-models
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does This repo is a classic “awesome list” — a hand-maintained index of resources for word, sentence, and document embeddings. It catalogs foundational papers (word2vec, GloVe, FastText, BERT, ELMo), key researchers, courses, datasets, and pre-trained model links. Think of it as a bibliography with working hyperlinks.

The interesting bit The list is frozen in a specific moment: 2018, when contextual embeddings like ELMo and BERT were arriving but before the transformer tsunami fully hit. It captures the transition from static word vectors to contextualized representations, with an entire section debating whether count-based or prediction-based methods win — a fight that now feels almost quaint.

Key highlights

  • Heavyweight paper coverage: Mikolov’s word2vec series, GloVe, FastText, plus the first BERT and ELMo papers
  • Pre-trained vector links for 157 languages via FastText, plus biomedical specials (BioWordVec, BioSentVec)
  • Curated researcher list (Mikolov, Bengio, Goldberg, Levy, Chen) with Google Scholar links
  • Evaluation datasets and papers questioning whether word-similarity tasks actually predict downstream performance
  • Implementation links for gensim, TensorFlow word2vec tutorials, and a GPU-optimized GloVe layer

Caveats

  • Last substantive update appears to be circa 2018; no modern sentence transformers, no OpenAI embeddings, no retrieval-augmented generation
  • The “Articles” section is commented out in the source, suggesting unfinished maintenance
  • Some TensorFlow links point to r0.12 documentation, which is archaeological at this point

Verdict Worth a bookmark if you’re doing historical NLP research, teaching an embeddings course, or need a quick reference to the pre-transformer canon. Skip it if you want practical guidance on modern vector search or API-based embedding services — this is a museum, not a manual.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.