huggingface/tokenizers
A high-performance Rust library providing fast tokenizers for NLP and language model preprocessing.

Provides implementations of today’s most widely used tokenizers with a focus on performance and versatility. The Rust implementation enables extremely fast training and tokenization, processing gigabytes of text on CPU in seconds. Designed for both research and production use, it handles vocabulary training, text-to-token conversion, normalization with alignment tracking, and preprocessing tasks like truncation, padding, and special token insertion. Available via bindings for Python, Node.js, Ruby, and other languages.