google/sentencepiece
A C++ and Python library implementing subword tokenization (BPE and unigram language model) for neural text generation models.

Velocity · 7d
+3.5
★ / day
Trend
→steady
star history
SentencePiece provides unsupervised text tokenization and detokenization primarily for neural network-based text generation systems. It implements subword units including byte-pair-encoding and unigram language model approaches. The library trains directly from raw sentences without language-specific preprocessing, enabling end-to-end systems.