← all repositories
rsennrich/subword-nmt

The original BPE toolkit that still ships in PyPI

Reference implementation of byte-pair encoding for neural MT, now with dropout and glossary support.

2.3k stars Python Language ModelsData Tooling
subword-nmt
Velocity · 7d
+0.6
★ / day
Trend
steady
star history

What it does

A set of command-line scripts that chop text into subword units using byte-pair encoding (BPE). You learn merge operations from a training corpus, then apply them to break rare words into manageable pieces for neural models. It also supports character n-gram segmentation and restoring the original text with a sed one-liner.

The interesting bit

This is the implementation behind the 2016 ACL paper that popularized BPE for NMT. The repo has since accumulated practical refinements: BPE dropout for data augmentation, glossary regexes to protect entities or tags from being mangled, and a vocabulary-filtering mode that prevents cross-lingual segmentation weirdness when you train BPE jointly on multiple languages.

Key highlights

  • learn-bpe / apply-bpe CLI with pip install from PyPI
  • Joint BPE + vocabulary filtering for multilingual setups
  • BPE dropout (--dropout 0.1) for training-time regularization
  • Glossary support via regex to shield tokens from segmentation
  • Byte-level BPE mode (--bytes) matching GPT-2’s approach
  • Backward-compatible with pre-0.2 BPE files

Caveats

  • The README notes that true per-batch BPE dropout (as in the original paper) requires manually copying your training corpus multiple times
  • No candidate images provided in the repository

Verdict

Worth keeping in your toolkit if you need reproducible, paper-faithful BPE with fine-grained controls. Most practitioners now get BPE from Hugging Face Tokenizers or sentencepiece; use this when you need the reference behavior or glossary protection.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.