Is subword-nmt open source?

Yes — rsennrich/subword-nmt is open source, released under the MIT license.

What language is subword-nmt written in?

rsennrich/subword-nmt is primarily written in Python.

How popular is subword-nmt?

rsennrich/subword-nmt has 2.3k stars on GitHub.

Where can I find subword-nmt?

rsennrich/subword-nmt is on GitHub at https://github.com/rsennrich/subword-nmt.

← all repositories

rsennrich/subword-nmt

The original BPE toolkit that still ships in PyPI

Reference implementation of byte-pair encoding for neural MT, now with dropout and glossary support.

★2.3k stars Python Language Models Data Tooling

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

A set of command-line scripts that chop text into subword units using byte-pair encoding (BPE). You learn merge operations from a training corpus, then apply them to break rare words into manageable pieces for neural models. It also supports character n-gram segmentation and restoring the original text with a sed one-liner.

The interesting bit

This is the implementation behind the 2016 ACL paper that popularized BPE for NMT. The repo has since accumulated practical refinements: BPE dropout for data augmentation, glossary regexes to protect entities or tags from being mangled, and a vocabulary-filtering mode that prevents cross-lingual segmentation weirdness when you train BPE jointly on multiple languages.

Key highlights

learn-bpe / apply-bpe CLI with pip install from PyPI
Joint BPE + vocabulary filtering for multilingual setups
BPE dropout (--dropout 0.1) for training-time regularization
Glossary support via regex to shield tokens from segmentation
Byte-level BPE mode (--bytes) matching GPT-2’s approach
Backward-compatible with pre-0.2 BPE files

Caveats

The README notes that true per-batch BPE dropout (as in the original paper) requires manually copying your training corpus multiple times
No candidate images provided in the repository

Verdict

Worth keeping in your toolkit if you need reproducible, paper-faithful BPE with fine-grained controls. Most practitioners now get BPE from Hugging Face Tokenizers or sentencepiece; use this when you need the reference behavior or glossary protection.

Frequently asked

What is rsennrich/subword-nmt?: Reference implementation of byte-pair encoding for neural MT, now with dropout and glossary support.
Is subword-nmt open source?: Yes — rsennrich/subword-nmt is open source, released under the MIT license.
What language is subword-nmt written in?: rsennrich/subword-nmt is primarily written in Python.
How popular is subword-nmt?: rsennrich/subword-nmt has 2.3k stars on GitHub.
Where can I find subword-nmt?: rsennrich/subword-nmt is on GitHub at https://github.com/rsennrich/subword-nmt.