Is bpemb open source?

Yes — bheinzerling/bpemb is open source, released under the MIT license.

What language is bpemb written in?

bheinzerling/bpemb is primarily written in Python.

How popular is bpemb?

bheinzerling/bpemb has 1.2k stars on GitHub.

Where can I find bpemb?

bheinzerling/bpemb is on GitHub at https://github.com/bheinzerling/bpemb.

← all repositories

bheinzerling/bpemb

275 languages, one embedding model, zero training time

Pre-trained subword embeddings for the long tail of languages that mainstream NLP usually ignores.

★1.2k stars Python Language Models Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

BPEmb ships pre-trained subword embeddings and Byte-Pair Encoding segmenters for 275 languages, from Abkhazian to Zulu. You pick a language code, vocabulary size, and embedding dimension; the library downloads the matching SentencePiece model and gensim-compatible vectors automatically. It then segments text into subword units and returns numpy arrays ready for a neural network.

The interesting bit

The project treats vocabulary size as a tunable dial: crank it down to 1,000 and “Stratford” shatters into [’▁str’, ‘at’, ‘f’, ‘ord’]; crank it to 25,000 and the word stays whole. That granularity lets you trade off model size against segmentation fineness without retraining anything. The coverage is also unusually broad—this is one of the few embedding sets that takes languages like Atikamekw and Goan Konkani as seriously as English and Mandarin.

Key highlights

275 languages supported, all trained on Wikipedia corpora
Embeddings exposed as gensim KeyedVectors for familiar .most_similar() and vector math
SentencePiece models handle segmentation; no custom tokenizer needed
Lazy download: models and vectors are fetched on first use, not at install time
Multiple vocabulary sizes (1k–200k) and dimensions (25–300) per language

Caveats

Training data is Wikipedia-only, so domain shift is likely on social media, code, or specialized text
The README notes a MultiBPEmb section but gives no details in the visible portion; multilingual joint training is unclear
No explicit license mentioned in the provided sources

Verdict

Grab this if you need embeddings for a low-resource language fast and don’t have the compute or data to train your own. Skip it if you’re already inside a modern transformer stack—subword tokenization is built in, and these static vectors won’t beat contextual representations on rich-resource languages.

Frequently asked

What is bheinzerling/bpemb?: Pre-trained subword embeddings for the long tail of languages that mainstream NLP usually ignores.
Is bpemb open source?: Yes — bheinzerling/bpemb is open source, released under the MIT license.
What language is bpemb written in?: bheinzerling/bpemb is primarily written in Python.
How popular is bpemb?: bheinzerling/bpemb has 1.2k stars on GitHub.
Where can I find bpemb?: bheinzerling/bpemb is on GitHub at https://github.com/bheinzerling/bpemb.