← all repositories

google/sentencepiece

A C++ and Python library implementing subword tokenization (BPE and unigram language model) for neural text generation models.

11.9k stars C++ Data Tooling
sentencepiece
Velocity · 7d
+3.5
★ / day
Trend
steady
star history

SentencePiece provides unsupervised text tokenization and detokenization primarily for neural network-based text generation systems. It implements subword units including byte-pair-encoding and unigram language model approaches. The library trains directly from raw sentences without language-specific preprocessing, enabling end-to-end systems.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.