← all repositories
bheinzerling/bpemb

275 languages, one embedding model, zero training time

Pre-trained subword embeddings for the long tail of languages that mainstream NLP usually ignores.

1.2k stars Python Language ModelsData Tooling
bpemb
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

What it does

BPEmb ships pre-trained subword embeddings and Byte-Pair Encoding segmenters for 275 languages, from Abkhazian to Zulu. You pick a language code, vocabulary size, and embedding dimension; the library downloads the matching SentencePiece model and gensim-compatible vectors automatically. It then segments text into subword units and returns numpy arrays ready for a neural network.

The interesting bit

The project treats vocabulary size as a tunable dial: crank it down to 1,000 and “Stratford” shatters into [’▁str’, ‘at’, ‘f’, ‘ord’]; crank it to 25,000 and the word stays whole. That granularity lets you trade off model size against segmentation fineness without retraining anything. The coverage is also unusually broad—this is one of the few embedding sets that takes languages like Atikamekw and Goan Konkani as seriously as English and Mandarin.

Key highlights

  • 275 languages supported, all trained on Wikipedia corpora
  • Embeddings exposed as gensim KeyedVectors for familiar .most_similar() and vector math
  • SentencePiece models handle segmentation; no custom tokenizer needed
  • Lazy download: models and vectors are fetched on first use, not at install time
  • Multiple vocabulary sizes (1k–200k) and dimensions (25–300) per language

Caveats

  • Training data is Wikipedia-only, so domain shift is likely on social media, code, or specialized text
  • The README notes a MultiBPEmb section but gives no details in the visible portion; multilingual joint training is unclear
  • No explicit license mentioned in the provided sources

Verdict

Grab this if you need embeddings for a low-resource language fast and don’t have the compute or data to train your own. Skip it if you’re already inside a modern transformer stack—subword tokenization is built in, and these static vectors won’t beat contextual representations on rich-resource languages.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.