VKCOM/YouTokenToMe
A fast C++ implementation of Byte Pair Encoding (BPE) tokenization for text preprocessing.

YouTokenToMe is an unsupervised text tokenizer implementing Byte Pair Encoding (BPE), a standard algorithm used to prepare text data for language models. It is implemented in C++ with Python bindings for computational efficiency, claiming up to 60x speedup over comparable tokenizers like Hugging Face tokenizers, fastBPE, and SentencePiece. The tool supports multithreading, BPE-dropout regularization, and provides both Python and command-line interfaces for training tokenizers and encoding text into subword sequences.