← all repositories

VKCOM/YouTokenToMe

A fast C++ implementation of Byte Pair Encoding (BPE) tokenization for text preprocessing.

979 stars C++ Data Tooling
YouTokenToMe
Velocity · 7d
+0.4
★ / day
Trend
steady
star history

YouTokenToMe is an unsupervised text tokenizer implementing Byte Pair Encoding (BPE), a standard algorithm used to prepare text data for language models. It is implemented in C++ with Python bindings for computational efficiency, claiming up to 60x speedup over comparable tokenizers like Hugging Face tokenizers, fastBPE, and SentencePiece. The tool supports multithreading, BPE-dropout regularization, and provides both Python and command-line interfaces for training tokenizers and encoding text into subword sequences.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.