275 languages, one embedding model, zero training time
Pre-trained subword embeddings for the long tail of languages that mainstream NLP usually ignores.

What it does
BPEmb ships pre-trained subword embeddings and Byte-Pair Encoding segmenters for 275 languages, from Abkhazian to Zulu. You pick a language code, vocabulary size, and embedding dimension; the library downloads the matching SentencePiece model and gensim-compatible vectors automatically. It then segments text into subword units and returns numpy arrays ready for a neural network.
The interesting bit
The project treats vocabulary size as a tunable dial: crank it down to 1,000 and “Stratford” shatters into [’▁str’, ‘at’, ‘f’, ‘ord’]; crank it to 25,000 and the word stays whole. That granularity lets you trade off model size against segmentation fineness without retraining anything. The coverage is also unusually broad—this is one of the few embedding sets that takes languages like Atikamekw and Goan Konkani as seriously as English and Mandarin.
Key highlights
- 275 languages supported, all trained on Wikipedia corpora
- Embeddings exposed as gensim
KeyedVectorsfor familiar.most_similar()and vector math - SentencePiece models handle segmentation; no custom tokenizer needed
- Lazy download: models and vectors are fetched on first use, not at install time
- Multiple vocabulary sizes (1k–200k) and dimensions (25–300) per language
Caveats
- Training data is Wikipedia-only, so domain shift is likely on social media, code, or specialized text
- The README notes a
MultiBPEmbsection but gives no details in the visible portion; multilingual joint training is unclear - No explicit license mentioned in the provided sources
Verdict
Grab this if you need embeddings for a low-resource language fast and don’t have the compute or data to train your own. Skip it if you’re already inside a modern transformer stack—subword tokenization is built in, and these static vectors won’t beat contextual representations on rich-resource languages.