The original BPE toolkit that still ships in PyPI
Reference implementation of byte-pair encoding for neural MT, now with dropout and glossary support.

What it does
A set of command-line scripts that chop text into subword units using byte-pair encoding (BPE). You learn merge operations from a training corpus, then apply them to break rare words into manageable pieces for neural models. It also supports character n-gram segmentation and restoring the original text with a sed one-liner.
The interesting bit
This is the implementation behind the 2016 ACL paper that popularized BPE for NMT. The repo has since accumulated practical refinements: BPE dropout for data augmentation, glossary regexes to protect entities or tags from being mangled, and a vocabulary-filtering mode that prevents cross-lingual segmentation weirdness when you train BPE jointly on multiple languages.
Key highlights
learn-bpe/apply-bpeCLI with pip install from PyPI- Joint BPE + vocabulary filtering for multilingual setups
- BPE dropout (
--dropout 0.1) for training-time regularization - Glossary support via regex to shield tokens from segmentation
- Byte-level BPE mode (
--bytes) matching GPT-2’s approach - Backward-compatible with pre-0.2 BPE files
Caveats
- The README notes that true per-batch BPE dropout (as in the original paper) requires manually copying your training corpus multiple times
- No candidate images provided in the repository
Verdict
Worth keeping in your toolkit if you need reproducible, paper-faithful BPE with fine-grained controls. Most practitioners now get BPE from Hugging Face Tokenizers or sentencepiece; use this when you need the reference behavior or glossary protection.